diff options
author | Adam Spragg <adam@spra.gg> | 2022-05-17 10:44:27 +0100 |
---|---|---|
committer | Adam Spragg <adam@spra.gg> | 2022-05-18 17:19:40 +0100 |
commit | 588b3e780fef14631f7ec5c369bff07efdc5e013 (patch) | |
tree | 5874b387da5e6f055abf70597c19cfaefe1760bb | |
parent | 96df5969b11b9a64f95c0c28347154b06cfc9d15 (diff) |
Add documentation for new Format 1
-rw-r--r-- | FILEFORMAT_1 | 88 | ||||
-rw-r--r-- | Makefile | 1 | ||||
-rw-r--r-- | README | 3 | ||||
-rw-r--r-- | man1/metastore.1 | 40 | ||||
-rw-r--r-- | metastore.txt | 35 |
5 files changed, 164 insertions, 3 deletions
diff --git a/FILEFORMAT_1 b/FILEFORMAT_1 new file mode 100644 index 0000000..b16d85a --- /dev/null +++ b/FILEFORMAT_1 @@ -0,0 +1,88 @@ +Version 1 +--------- + +Following sections explain internals of metastore file (.metadata), version 1 + + +### Data types + + SIGNATURE = Magic signature, 10 bytes long = "MeTaSt00r3" + VERSION = Format version string, 8 bytes long. This version = "00000001" + URLSTRING = URL-encoded string. Chars 0x00 to 0x20 (inclusive), 0x7F, and + 0x25 (%) *must* be encoded. Terminated by any character which + must be encoded, which is not (e.g. "\t", "\n"). + INTSTRING = ASCII-encoded integer, in a pre-specified base from 2 (binary) + to 16 (hexadecimal). May be preceded by "-" for negative values + +### File layout + + SIGNATURE VERSION "\n" + n * (ENTRY "\n") + + +### ENTRY format + + URLSTRING - Path (absolute or relative) + "\t" URLSTRING - Owner (owner name, not uid) + "\t" URLSTRING - Group (group name, not gid) + "\t" INTSTRING - Mode (base 8, of struct stat.st_mode & 0177777. + i.e. File type and mode, as per inode(7) + "\t" URLSTRING - Mtime (including nanoseconds) in ISO-8601 format, UTC. + "YYYY-mm-ddTHH:MM:SS.nnnnnnnnnZ" + + m * ("\t" URLSTRING "\t" URLSTRING) + - xattr name/value pairs. `m` may be 0. + + "\n" - Entry-terminating newline. + + +### Discussion + +This format is designed to work with version control systems, specifically +`git(1)`. + +To fit in with `git` and its related tooling, this format is a line-based text +file. Each record is a bunch of text fields separated by tabs, terminated by a +newline. This means records should be identifiable and somewhat understandable +to readers, and should work with `diff(1)` and `patch(1)` (and their `git`ified +descendants). Merge conflicts should produce files that are resolvable with any +ordinary text editor. (Even `ed(1)`, if you insist!) + +This format is generally slightly larger than Format 0, but shouldn't be +significantly so for most use cases, and this is a reasonable trade-off for +readability and diff/patch/merge-ability. + +The format could be significantly larger if files have large amounts of binary +data in xattrs, as 35 out of the 256 possible bytes require URL-encoding as a +3-byte sequence, giving a 27.3% increase (by my calculations). This clearly +isn't ideal, but this author suspects that the proportion of files with large +binary xattrs is fairly small, and this should not cause an issue in practice. + +If a user does have large amounts of binary xattr data but can't handle the 27% +size increase this format incurs, they can still use Format 0 to store it +instead. If *you* have large amounts of binary xattr data that you have to store +in git in a way that's diff/patch/merge-able - well, feel free to submit patches +for Format 2 yourself ;-) + +If you do update this format, remember to change the man page as well as this +document! I've tried to keep the info in the man page as short as possible, and +to only include what a user should need to work with the resulting files. +Extended musings and notes for implementors go here (or in the `git commit` +log :-) + + +### UTF-8 cleanliness + +Note that because bytes >= 0x80 are not required to be URL-encoded, binary xattr +data is very unlikely to be UTF-8 clean. If this is a problem for the editor +you use to resolve conflicts... I dunno. Get a better editor maybe? We could +URL-encode all high bytes, but that would triple the size of half the bytes in +binary data, and of all non-ASCII byte sequences in UTF-8 text. I suppose it +might be possible to URL-encode all sequences of high bytes that are *not* UTF-8 +clean (and that would be backwards-compatible with the existing format) but I +don't want to add that much complexity at the moment. Also, it might not be +"enough" as you'd probably want to encode non-printable UTF-8 control codes +such as RTL/LTR marks (U+200E/U+200F) to prevent the possibility of "Trojan +Source" type attacks. + +(See <https://lwn.net/Articles/874951/> for more info on "Trojan Source") @@ -27,6 +27,7 @@ UNAME_S := $(shell uname -s) DOCS := \ AUTHORS \ FILEFORMAT_0 \ + FILEFORMAT_1 \ LICENSE.GPLv2 \ NEWS \ README \ @@ -44,7 +44,8 @@ Dump action can be really helpful in such cases. File format ----------- -See FILEFORMAT_0 file, which describes internals of metastore file. +See FILEFORMAT_0 and FILEFORMAT_1 files, which describes internals of metastore +file versions. Requirements diff --git a/man1/metastore.1 b/man1/metastore.1 index d788161..1e01daf 100644 --- a/man1/metastore.1 +++ b/man1/metastore.1 @@ -83,6 +83,46 @@ ensure that the stored metadata is interpreted correctly. .B 0 The original and default format, it is a compact binary representation of the file metadata stored. +.TP +.B 1 +This format is a tab-separated, line-based text representation of the file +metadata stored. Is is not as compact as Format \fB0\fR, but is designed to +integrate with text-based version control mechanisms, like diffs, patches, +merges, and conflicts. + +After the signature/version header line, the format is: + +.EX +<path> <owner> <group> <mode> <mtime> [<xattr name> <xattr value> ...] +.EE + +Where +.I owner +and +.I group +are names, not ids, +.I mode +is the octal representation of the 16-bit "file type and mode" field described +in +.BR inode (7), +and +.I mtime +is the ISO-8601 extended format representation of the last-modified time in UTC, +with nanosecond precision. + +Strings are URL-encoded, and all characters from 0x00 to 0x20 (inclusive), 0x25 +(%) and 0x7F \fBmust\fR be encoded. + +As mentioned above, the format is primarily designed to be compatible with +version control tools. It is secondarily designed to be mostly-readable by +humans like you, because humans use those tools. It is \fInot\fR specifically +designed to be written by humans. In the case of merge conflicts that require +intervention it is recommended that you pick one existing version of an entry, +rather than trying to edit one of your own with aspects of both. (Or manually +reset the file permissions, and re-generate the metainfo file.) Particularly, +because bytes >= 0x80 are not URL-encoded, binary xattr data probably won't be +UTF-8 clean, so you may have a hard time doing anything other than deleting +unwanted lines with many editors. .\" .SH AUTHORS metastore was created by David Härdeman in 2007-2008. diff --git a/metastore.txt b/metastore.txt index f3d7d55..39fdf5e 100644 --- a/metastore.txt +++ b/metastore.txt @@ -87,9 +87,40 @@ FORMATS 0 The original and default format, it is a compact binary repre‐ sentation of the file metadata stored. + 1 This format is a tab-separated, line-based text representation + of the file metadata stored. Is is not as compact as Format 0, + but is designed to integrate with text-based version control + mechanisms, like diffs, patches, merges, and conflicts. + + After the signature/version header line, the format is: + + <path> <owner> <group> <mode> <mtime> [<xattr name> <xattr value> ...] + + Where owner and group are names, not ids, mode is the octal rep‐ + resentation of the 16-bit "file type and mode" field described + in inode(7), and mtime is the ISO-8601 extended format represen‐ + tation of the last-modified time in UTC, with nanosecond preci‐ + sion. + + Strings are URL-encoded, and all characters from 0x00 to 0x20 + (inclusive), 0x25 (%) and 0x7F must be encoded. + + As mentioned above, the format is primarily designed to be com‐ + patible with version control tools. It is secondarily designed + to be mostly-readable by humans like you, because humans use + those tools. It is not specifically designed to be written by + humans. In the case of merge conflicts that require intervention + it is recommended that you pick one existing version of an en‐ + try, rather than trying to edit one of your own with aspects of + both. (Or manually reset the file permissions, and re-generate + the metainfo file.) Particularly, because bytes >= 0x80 are not + URL-encoded, binary xattr data probably won't be UTF-8 clean, so + you may have a hard time doing anything other than deleting un‐ + wanted lines with many editors. + AUTHORS - metastore was created by David Härdeman in 2007-2008. Now it is main‐ - tained by Przemysław Pawełczyk. All source code contributors are + metastore was created by David Härdeman in 2007-2008. Now it is main‐ + tained by Przemysław Pawełczyk. All source code contributors are listed in the AUTHORS file. |