summaryrefslogtreecommitdiff
path: root/FILEFORMAT_1
blob: 89d2a86d2396fc0fbceeb70b607dc6bf463d4e5c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
Version 1
---------

Following sections explain internals of metastore file (.metadata), version 1


### Data types

    SIGNATURE   = Magic signature, 10 bytes long = "MeTaSt00r3"
    VERSION     = Format version string, 8 bytes long. This version = "00000001"
    URLSTRING   = URL-encoded string. Chars 0x00 to 0x20 (inclusive), 0x7F, and
                  0x25 (%) *must* be encoded. Terminated by any character which
                  must be encoded, which is not (e.g. "\t", "\n").
    INTSTRING   = ASCII-encoded integer, in a pre-specified base from 2 (binary)
                  to 16 (hexadecimal). May be preceded by "-" for negative values

### File layout

    SIGNATURE VERSION "\n"
    n * (ENTRY "\n")


### ENTRY format

    URLSTRING           - Path  (absolute or relative)
    "\t" URLSTRING      - Owner (owner name, not uid)
    "\t" URLSTRING      - Group (group name, not gid)
    "\t" INTSTRING      - Mode  (base 8, of struct stat.st_mode & 0177777.
                                i.e. File type and mode, as per inode(7)
    "\t" URLSTRING      - Mtime (including nanoseconds) in ISO-8601 format, UTC.
                                "YYYY-mm-ddTHH:MM:SS.nnnnnnnnnZ"
                                Or a literal "0" if mtime is not saved

    m * ("\t" URLSTRING "\t" URLSTRING)
                        - xattr name/value pairs. `m` may be 0.

    "\n"                - Entry-terminating newline.


### Discussion

This format is designed to work with version control systems, specifically
`git(1)`.

To fit in with `git` and its related tooling, this format is a line-based text
file. Each record is a bunch of text fields separated by tabs, terminated by a
newline. This means records should be identifiable and somewhat understandable
to readers, and should work with `diff(1)` and `patch(1)` (and their `git`ified
descendants). Merge conflicts should produce files that are resolvable with any
ordinary text editor. (Even `ed(1)`, if you insist!)

This format is generally slightly larger than Format 0, but shouldn't be
significantly so for most use cases, and this is a reasonable trade-off for
readability and diff/patch/merge-ability.

The format could be significantly larger if files have large amounts of binary
data in xattrs, as 35 out of the 256 possible bytes require URL-encoding as a
3-byte sequence, giving a 27.3% increase (by my calculations). This clearly
isn't ideal, but this author suspects that the proportion of files with large
binary xattrs is fairly small, and this should not cause an issue in practice.

If a user does have large amounts of binary xattr data but can't handle the 27%
size increase this format incurs, they can still use Format 0 to store it
instead. If *you* have large amounts of binary xattr data that you have to store
in git in a way that's diff/patch/merge-able - well, feel free to submit patches
for Format 2 yourself ;-)

If you do update this format, remember to change the man page as well as this
document! I've tried to keep the info in the man page as short as possible, and
to only include what a user should need to work with the resulting files.
Extended musings and notes for implementors go here (or in the `git commit`
log :-)


### UTF-8 cleanliness

Note that because bytes >= 0x80 are not required to be URL-encoded, binary xattr
data is very unlikely to be UTF-8 clean. If this is a problem for the editor
you use to resolve conflicts... I dunno. Get a better editor maybe? We could
URL-encode all high bytes, but that would triple the size of half the bytes in
binary data, and of all non-ASCII byte sequences in UTF-8 text. I suppose it
might be possible to URL-encode all sequences of high bytes that are *not* UTF-8
clean (and that would be backwards-compatible with the existing format) but I
don't want to add that much complexity at the moment. Also, it might not be
"enough" as you'd probably want to encode non-printable UTF-8 control codes
such as RTL/LTR marks (U+200E/U+200F) to prevent the possibility of "Trojan
Source" type attacks.

(See <https://lwn.net/Articles/874951/> for more info on "Trojan Source")


### Sorting

To generate stable metatdata files that do not depend on the order that files
are returned by `readdir()`, which would producing spurious diffs, sort entries
by path ASCIIbetically, as with `strcmp(3)`. But we don't require metadata files
to be sorted when reading them.

If we read two entries for the same path, the results are currently unspecified.

Users should probably avoid sorting Format 1 files with standard tools like
`sort(1)`, as the HEADER line must always be first, and also URL-encoded
characters will throw off the sort order. Also, tools like `sort` will typically
sort according to the current locale, e.g. using `strcoll(3)` rather than
`strcmp()`.