summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAdam Spragg <adam@spra.gg>2022-05-17 10:44:27 +0100
committerAdam Spragg <adam@spra.gg>2022-05-18 17:19:40 +0100
commit588b3e780fef14631f7ec5c369bff07efdc5e013 (patch)
tree5874b387da5e6f055abf70597c19cfaefe1760bb
parent96df5969b11b9a64f95c0c28347154b06cfc9d15 (diff)
Add documentation for new Format 1
-rw-r--r--FILEFORMAT_188
-rw-r--r--Makefile1
-rw-r--r--README3
-rw-r--r--man1/metastore.140
-rw-r--r--metastore.txt35
5 files changed, 164 insertions, 3 deletions
diff --git a/FILEFORMAT_1 b/FILEFORMAT_1
new file mode 100644
index 0000000..b16d85a
--- /dev/null
+++ b/FILEFORMAT_1
@@ -0,0 +1,88 @@
+Version 1
+---------
+
+Following sections explain internals of metastore file (.metadata), version 1
+
+
+### Data types
+
+ SIGNATURE = Magic signature, 10 bytes long = "MeTaSt00r3"
+ VERSION = Format version string, 8 bytes long. This version = "00000001"
+ URLSTRING = URL-encoded string. Chars 0x00 to 0x20 (inclusive), 0x7F, and
+ 0x25 (%) *must* be encoded. Terminated by any character which
+ must be encoded, which is not (e.g. "\t", "\n").
+ INTSTRING = ASCII-encoded integer, in a pre-specified base from 2 (binary)
+ to 16 (hexadecimal). May be preceded by "-" for negative values
+
+### File layout
+
+ SIGNATURE VERSION "\n"
+ n * (ENTRY "\n")
+
+
+### ENTRY format
+
+ URLSTRING - Path (absolute or relative)
+ "\t" URLSTRING - Owner (owner name, not uid)
+ "\t" URLSTRING - Group (group name, not gid)
+ "\t" INTSTRING - Mode (base 8, of struct stat.st_mode & 0177777.
+ i.e. File type and mode, as per inode(7)
+ "\t" URLSTRING - Mtime (including nanoseconds) in ISO-8601 format, UTC.
+ "YYYY-mm-ddTHH:MM:SS.nnnnnnnnnZ"
+
+ m * ("\t" URLSTRING "\t" URLSTRING)
+ - xattr name/value pairs. `m` may be 0.
+
+ "\n" - Entry-terminating newline.
+
+
+### Discussion
+
+This format is designed to work with version control systems, specifically
+`git(1)`.
+
+To fit in with `git` and its related tooling, this format is a line-based text
+file. Each record is a bunch of text fields separated by tabs, terminated by a
+newline. This means records should be identifiable and somewhat understandable
+to readers, and should work with `diff(1)` and `patch(1)` (and their `git`ified
+descendants). Merge conflicts should produce files that are resolvable with any
+ordinary text editor. (Even `ed(1)`, if you insist!)
+
+This format is generally slightly larger than Format 0, but shouldn't be
+significantly so for most use cases, and this is a reasonable trade-off for
+readability and diff/patch/merge-ability.
+
+The format could be significantly larger if files have large amounts of binary
+data in xattrs, as 35 out of the 256 possible bytes require URL-encoding as a
+3-byte sequence, giving a 27.3% increase (by my calculations). This clearly
+isn't ideal, but this author suspects that the proportion of files with large
+binary xattrs is fairly small, and this should not cause an issue in practice.
+
+If a user does have large amounts of binary xattr data but can't handle the 27%
+size increase this format incurs, they can still use Format 0 to store it
+instead. If *you* have large amounts of binary xattr data that you have to store
+in git in a way that's diff/patch/merge-able - well, feel free to submit patches
+for Format 2 yourself ;-)
+
+If you do update this format, remember to change the man page as well as this
+document! I've tried to keep the info in the man page as short as possible, and
+to only include what a user should need to work with the resulting files.
+Extended musings and notes for implementors go here (or in the `git commit`
+log :-)
+
+
+### UTF-8 cleanliness
+
+Note that because bytes >= 0x80 are not required to be URL-encoded, binary xattr
+data is very unlikely to be UTF-8 clean. If this is a problem for the editor
+you use to resolve conflicts... I dunno. Get a better editor maybe? We could
+URL-encode all high bytes, but that would triple the size of half the bytes in
+binary data, and of all non-ASCII byte sequences in UTF-8 text. I suppose it
+might be possible to URL-encode all sequences of high bytes that are *not* UTF-8
+clean (and that would be backwards-compatible with the existing format) but I
+don't want to add that much complexity at the moment. Also, it might not be
+"enough" as you'd probably want to encode non-printable UTF-8 control codes
+such as RTL/LTR marks (U+200E/U+200F) to prevent the possibility of "Trojan
+Source" type attacks.
+
+(See <https://lwn.net/Articles/874951/> for more info on "Trojan Source")
diff --git a/Makefile b/Makefile
index 0c8bdd3..62c310c 100644
--- a/Makefile
+++ b/Makefile
@@ -27,6 +27,7 @@ UNAME_S := $(shell uname -s)
DOCS := \
AUTHORS \
FILEFORMAT_0 \
+ FILEFORMAT_1 \
LICENSE.GPLv2 \
NEWS \
README \
diff --git a/README b/README
index 907cca7..19d85c7 100644
--- a/README
+++ b/README
@@ -44,7 +44,8 @@ Dump action can be really helpful in such cases.
File format
-----------
-See FILEFORMAT_0 file, which describes internals of metastore file.
+See FILEFORMAT_0 and FILEFORMAT_1 files, which describes internals of metastore
+file versions.
Requirements
diff --git a/man1/metastore.1 b/man1/metastore.1
index d788161..1e01daf 100644
--- a/man1/metastore.1
+++ b/man1/metastore.1
@@ -83,6 +83,46 @@ ensure that the stored metadata is interpreted correctly.
.B 0
The original and default format, it is a compact binary representation of the
file metadata stored.
+.TP
+.B 1
+This format is a tab-separated, line-based text representation of the file
+metadata stored. Is is not as compact as Format \fB0\fR, but is designed to
+integrate with text-based version control mechanisms, like diffs, patches,
+merges, and conflicts.
+
+After the signature/version header line, the format is:
+
+.EX
+<path> <owner> <group> <mode> <mtime> [<xattr name> <xattr value> ...]
+.EE
+
+Where
+.I owner
+and
+.I group
+are names, not ids,
+.I mode
+is the octal representation of the 16-bit "file type and mode" field described
+in
+.BR inode (7),
+and
+.I mtime
+is the ISO-8601 extended format representation of the last-modified time in UTC,
+with nanosecond precision.
+
+Strings are URL-encoded, and all characters from 0x00 to 0x20 (inclusive), 0x25
+(%) and 0x7F \fBmust\fR be encoded.
+
+As mentioned above, the format is primarily designed to be compatible with
+version control tools. It is secondarily designed to be mostly-readable by
+humans like you, because humans use those tools. It is \fInot\fR specifically
+designed to be written by humans. In the case of merge conflicts that require
+intervention it is recommended that you pick one existing version of an entry,
+rather than trying to edit one of your own with aspects of both. (Or manually
+reset the file permissions, and re-generate the metainfo file.) Particularly,
+because bytes >= 0x80 are not URL-encoded, binary xattr data probably won't be
+UTF-8 clean, so you may have a hard time doing anything other than deleting
+unwanted lines with many editors.
.\"
.SH AUTHORS
metastore was created by David Härdeman in 2007-2008.
diff --git a/metastore.txt b/metastore.txt
index f3d7d55..39fdf5e 100644
--- a/metastore.txt
+++ b/metastore.txt
@@ -87,9 +87,40 @@ FORMATS
0 The original and default format, it is a compact binary repre‐
sentation of the file metadata stored.
+ 1 This format is a tab-separated, line-based text representation
+ of the file metadata stored. Is is not as compact as Format 0,
+ but is designed to integrate with text-based version control
+ mechanisms, like diffs, patches, merges, and conflicts.
+
+ After the signature/version header line, the format is:
+
+ <path> <owner> <group> <mode> <mtime> [<xattr name> <xattr value> ...]
+
+ Where owner and group are names, not ids, mode is the octal rep‐
+ resentation of the 16-bit "file type and mode" field described
+ in inode(7), and mtime is the ISO-8601 extended format represen‐
+ tation of the last-modified time in UTC, with nanosecond preci‐
+ sion.
+
+ Strings are URL-encoded, and all characters from 0x00 to 0x20
+ (inclusive), 0x25 (%) and 0x7F must be encoded.
+
+ As mentioned above, the format is primarily designed to be com‐
+ patible with version control tools. It is secondarily designed
+ to be mostly-readable by humans like you, because humans use
+ those tools. It is not specifically designed to be written by
+ humans. In the case of merge conflicts that require intervention
+ it is recommended that you pick one existing version of an en‐
+ try, rather than trying to edit one of your own with aspects of
+ both. (Or manually reset the file permissions, and re-generate
+ the metainfo file.) Particularly, because bytes >= 0x80 are not
+ URL-encoded, binary xattr data probably won't be UTF-8 clean, so
+ you may have a hard time doing anything other than deleting un‐
+ wanted lines with many editors.
+
AUTHORS
- metastore was created by David Härdeman in 2007-2008. Now it is main‐
- tained by Przemysław Pawełczyk. All source code contributors are
+ metastore was created by David Härdeman in 2007-2008. Now it is main‐
+ tained by Przemysław Pawełczyk. All source code contributors are
listed in the AUTHORS file.