summaryrefslogtreecommitdiff
path: root/FILEFORMAT_1
blob: b16d85a597697aadb88f2c39814f50091ef67ad0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
Version 1
---------

Following sections explain internals of metastore file (.metadata), version 1


### Data types

    SIGNATURE   = Magic signature, 10 bytes long = "MeTaSt00r3"
    VERSION     = Format version string, 8 bytes long. This version = "00000001"
    URLSTRING   = URL-encoded string. Chars 0x00 to 0x20 (inclusive), 0x7F, and
                  0x25 (%) *must* be encoded. Terminated by any character which
                  must be encoded, which is not (e.g. "\t", "\n").
    INTSTRING   = ASCII-encoded integer, in a pre-specified base from 2 (binary)
                  to 16 (hexadecimal). May be preceded by "-" for negative values

### File layout

    SIGNATURE VERSION "\n"
    n * (ENTRY "\n")


### ENTRY format

    URLSTRING           - Path  (absolute or relative)
    "\t" URLSTRING      - Owner (owner name, not uid)
    "\t" URLSTRING      - Group (group name, not gid)
    "\t" INTSTRING      - Mode  (base 8, of struct stat.st_mode & 0177777.
                                i.e. File type and mode, as per inode(7)
    "\t" URLSTRING      - Mtime (including nanoseconds) in ISO-8601 format, UTC.
                                "YYYY-mm-ddTHH:MM:SS.nnnnnnnnnZ"

    m * ("\t" URLSTRING "\t" URLSTRING)
                        - xattr name/value pairs. `m` may be 0.

    "\n"                - Entry-terminating newline.


### Discussion

This format is designed to work with version control systems, specifically
`git(1)`.

To fit in with `git` and its related tooling, this format is a line-based text
file. Each record is a bunch of text fields separated by tabs, terminated by a
newline. This means records should be identifiable and somewhat understandable
to readers, and should work with `diff(1)` and `patch(1)` (and their `git`ified
descendants). Merge conflicts should produce files that are resolvable with any
ordinary text editor. (Even `ed(1)`, if you insist!)

This format is generally slightly larger than Format 0, but shouldn't be
significantly so for most use cases, and this is a reasonable trade-off for
readability and diff/patch/merge-ability.

The format could be significantly larger if files have large amounts of binary
data in xattrs, as 35 out of the 256 possible bytes require URL-encoding as a
3-byte sequence, giving a 27.3% increase (by my calculations). This clearly
isn't ideal, but this author suspects that the proportion of files with large
binary xattrs is fairly small, and this should not cause an issue in practice.

If a user does have large amounts of binary xattr data but can't handle the 27%
size increase this format incurs, they can still use Format 0 to store it
instead. If *you* have large amounts of binary xattr data that you have to store
in git in a way that's diff/patch/merge-able - well, feel free to submit patches
for Format 2 yourself ;-)

If you do update this format, remember to change the man page as well as this
document! I've tried to keep the info in the man page as short as possible, and
to only include what a user should need to work with the resulting files.
Extended musings and notes for implementors go here (or in the `git commit`
log :-)


### UTF-8 cleanliness

Note that because bytes >= 0x80 are not required to be URL-encoded, binary xattr
data is very unlikely to be UTF-8 clean. If this is a problem for the editor
you use to resolve conflicts... I dunno. Get a better editor maybe? We could
URL-encode all high bytes, but that would triple the size of half the bytes in
binary data, and of all non-ASCII byte sequences in UTF-8 text. I suppose it
might be possible to URL-encode all sequences of high bytes that are *not* UTF-8
clean (and that would be backwards-compatible with the existing format) but I
don't want to add that much complexity at the moment. Also, it might not be
"enough" as you'd probably want to encode non-printable UTF-8 control codes
such as RTL/LTR marks (U+200E/U+200F) to prevent the possibility of "Trojan
Source" type attacks.

(See <https://lwn.net/Articles/874951/> for more info on "Trojan Source")