Artisanal handcrafted Git repositories

drew.silcock.dev

242 points by drewsberry a day ago

bradfitz 16 hours ago

My recent horror from some git work was discovering how git sorts its tree objects.

The docs just say to sort by C locale (byte-order sorting). Easy. Except git was sometimes rejecting my packfiles as being bogus per its fsck code, saying my trees were misordered.

TURNS OUT THERE'S AN UNDOCUMENTED RULE: you need to append an implicit forward slash to directory tree entry names before you sort them.

That forward slash is not encoded in the tree object, nor is the type of the entry. You just put the 20 byte SHA1 hash, which is to either a blob or a hash (or a commit for submodules).

So you can have one directory with directory "testing" and file "testing.md" and it'll sort differently than a directory with two files "testing" and "testing.md".

You can see a repro at https://gist.github.com/bradfitz/4751c58b07b57ff303cbfec3e39...

(So to verify whether a tree object is formatted correctly, you need to have the blobs of all the entries in the tree, at least one level)

xqb64 10 hours ago

I've had this exact bug happen to me when I implemented my git clone.
The way I found out was that Github kept rejecting my push, because as I later discovered, my git history was invalid precisely due to entries being sorted improperly due to the forward slash requirement. I could have solved this with the real git, but the point was to use my tool exclusively for version control from inception, so I just deleted the .git folder. So, my git history appears to begin near the end of the whole cycle. But I did manage to learn a lot, both about git and about the language I implemented it in.
Elucalidavah 10 hours ago

> directory tree entry names
But... git doesn't really store directories, does it?
- kaoD 10 hours ago
  
  I wrote a longer comment saying this (deleted now since I was wrong).
  Turns out that Git does somewhat store dirs (in form of trees). See https://git-scm.com/book/en/v2/Git-Internals-Git-Objects (section "Tree Objects").
  To understand op's repro look at the last two lines (objects in the tree) in each of their command outputs, not the files shown in the first few lines.
  What I think op means is that the `testing` tree pointed in their first example is sorted after `testing.md` even though it's only called `testing` because it's being sorted as `testing/` and `/` is > `.` bytewise.
  I'm not at a computer right now but it would be nice to test it with files named `testing.` and `testing0` since they are adjacent bytewise and would show the implicit forward slash more clearly with the tree object sitting between them.
  This makes me wonder why Git can't just store an empty tree for empty dirs.
  EDIT: did the Gist https://gist.github.com/alvaro-cuesta/bd0234e3e1a66819c7e9e9...
  Notice the `git cat-file -p HEAD^{tree}` outputs.
  - lucasoshiro 2 hours ago
    
    > This makes me wonder why Git can't just store an empty tree for empty dirs.
    tl;dr: it can (see my other comment) and the empty tree is hardcoded. But since the index works with file paths and blobs, having no file means that there's no entry in the index
- remram 9 hours ago
  
  Yes it does, it just doesn't store empty directories.
  - lucasoshiro 2 hours ago
    
    It can store empty directories (actually, trees). It can't do normally because the index maps paths to blobs, an empty directory doesn't have a file to map to a blob and then `git add` will have no effect. Given that normally we write commits from the index content, then normally we won't find an empty tree.
    You can run `git commit --allow-empty` with an empty index and the root tree will be the empty tree:
    $ git init $ git commit --allow-empty -m foo $ git rev-parse @^{tree} 4b825dc642cb6eb9a060e54bf8d69288fbee4904
    4b825dc is the empty tree. And a funny thing about it is that it is hardcoded in Git, and you can use it without having this object:
    $ git init $ git commit-tree -m foo 4b825dc642cb6eb9a060e54bf8d69288fbee4904 $ tree .git/objects # you'll see that there's no file for the empty tree
    This is a good reading about that weird object: https://matheustavares.dev/posts/empty-tree
  - juped 3 hours ago
    
    You can perfectly easily put the empty tree object as a tree object's child, this just isn't supported and some parts of Git will break.

lucasoshiro 21 hours ago

Something that I really like in Git is how its data structures are easy to understand and how transparent it is. It's possible to write your own "Git" compatible with existing Git directories only by reading how it works under the hood

shivasaxena 20 hours ago

I agree, but only in theory.
Projects like gitoxide have been in development for years now.
- fiddlerwoaroof 19 hours ago
  
  I wrote a nearly complete implementation of git file format parsers in Common Lisp over like a month of evenings and weekends. I’m sure there are hard parts between where I am and a full git implementation but you can get quite a bit of utility out of a relatively small amount of effort.
  - MrJohz 14 hours ago
    
    It's a case of Pareto. Parsing the git file format is relatively simple, but handling all the weird states a Git repo can be in and doing the correct things to those files in each state is a lot harder. And then adding the network protocol on top of that makes directly reproducing Git quite difficult.
    I know JJ used to use Git2 for a lot of network operations like pushing and pulling, but ran into too many issues with SSH handling that they've since switched to directly invoking the Git binary for those operations.
    
    fiddlerwoaroof 14 hours ago
    
    There aren’t that many weird states a git repository can be in: the on-disk format of the repository is too simple for that. The hard part has to do with the various protocols for transferring objects around.
    
    deathanatos 12 hours ago
    
    I think there's more corners out there than most people would give credit to? Just off the top of my head: files in the index (but maybe this isn't "weird enough"), rebasing but paused, rebasing with conflicts, merge with conflicts, cherry-picking but conflicts, middle of a bisect with all the state that implies, alternate objects dirs, alternate working dirs, submodules and all of their weirdness, and a "bare" repo.
    Heck, had my PS1 return an error this week after I created a separate working dir for a repo and cd'd into it. Did you know .git can be a normal file? I didn't when I wrote my PS1.
    
    fiddlerwoaroof an hour ago
    
    I knew .git can be a normal file because of worktrees. But most of the weird states have to do with the working tree not the repository. Even rebasing isn’t weird as far as the file formats go: it just is replaying commits on top of a new base commit. Since my goal was basically to implement enough of git to serve files from a git repository as a website, the actual task was fairly small.
  - lucasoshiro 17 hours ago
    
    Yeah, I wrote mine in Haskell. It's a good exercise for understanding how Git works
- chubot 17 hours ago
  
  Not sure what gitoxide is, but libgit already exists, and it seems to be an independent implementation - https://github.com/libgit2/libgit2
  I think Github and most big Git hosts use it
  - steveklabnik 15 hours ago
    
    libgit2 has a ton of compatibility issues, especially around authentication, that make it only useful in some circumstances.
    (gitoxide is a similar project but in Rust, it's not ready for the big time either, though it keeps on getting better!)
  - 3eb7988a1663 15 hours ago
    
    Jujitsu threw in the towel and is shelling out to the git CLI because of minor variations in libgit vs the binary.
    Failing to find a write-up, but there was this lobster thread[0] where someone from GitLab reported they had to do the same owing to some discrepancies vs the binary -where all of the real development happens.
    [0] https://lobste.rs/s/vmdggh/jujutsu_v0_30_0_released
    
    Dylan16807 13 hours ago
    
    But nothing in that description of problems is tied to the repository format.

veganjay 19 hours ago

Neat to see this done by hand! It helps demystify the magic behind git commands.

If you like this, I also recommend "Write Yourself a Git", where you build a minimal git implementation using python: https://wyag.thb.lt/

xqb64 11 hours ago

There is also James Coglan's "Building git" book that I just went through and can vouch for its quality.
bhasi 17 hours ago

A similar project is CodeCrafters' Build Your Own Git: https://app.codecrafters.io/courses/git/overview
wonderwonder 18 hours ago

How cool, thank you

sc68cal 21 hours ago

To the site author: I'm on a MBP M1 Mac and honestly I can't really read the text. Far too small, and increasing the zoom just makes the text large but the margins less wide. Firefox reader mode also renders really badly.

Please, consider making the layout better for us old coders whose eyes are going, or for hi res displays

retsibsi 2 hours ago

For me, the text size would be fine if the contrast were better. The background colour is similar to the colour of the non-central pixels of the text, and even the central pixels are grey rather than black.
derefr 19 hours ago

FYI: the pinch-to-zoom gesture from mobile browsers (from before websites were mobile-responsive) has also long been implemented for all modern desktop browsers. It's viewport zoom, which is far better than the font-scaling zoom you get by pressing Cmd-+, and makes this site easily readable.
(The much-less-well-known mobile double-tap-on-text gesture [it zooms-to-fit whatever element you tapped on to the width of the viewport] was also ported to desktop browsers. Though, on desktop with a touchpad, it's a two-finger double-tap — which I don't think anyone would ever even think to try.)
- LocalPCGuy 3 hours ago
  
  FWIW, most browsers by default now do a viewport zoom with Ctrl/Cmd-+ rather than a font-scaling zoom. I think browsers generally have the option to change that, so if you prefer the former but it's doing the latter, may check the browser settings.
- BobaFloutist 18 hours ago
  
  Double tap on text highlights it for me. Is that an Iphone/android thing or what?
  - derefr 17 hours ago
    
    As I said, it's a two finger double-tap.
    But also, under further investigation — and unlike with pinch-to-zoom — desktop support for the two-finger double-tap gesture seems to be specific to macOS. (Which is weird, because Chrome has support for arbitrary multitouch gesture processing to enable the JS multitouch API. So you'd think Chrome's support for "the multitouch gestures the OS expects" would be built on top of that generic multitouch recognizer [and therefore working everywhere that recognizer works], instead of expecting the OS to pre-recognize specific gestures and translate them to specific OS input events.)
    
    BobaFloutist 16 hours ago
    
    I was trying on my phone, but my laptop seems to interpret it as a right click. Which, frankly, makes sense.
  - antonvs 12 hours ago
    
    On my iPad in Safari and Pixel Android phone in Firefox, one-finger double tap on text does the fit to viewport.
    On my Ubuntu laptop in Chrome, I couldn’t find a way to make it work - even tapping the touchscreen didn’t work. But I’m not using the stock Ubuntu GUI, so it could be that (LXqt+XMonad).
    
    BobaFloutist 25 minutes ago
    
    >Pixel Android phone in Firefox, one-finger double tap on text does the fit to viewport.
    I'm very confused, can you clarify what makes this different from the gesture that highlights text?
    Edit: it appears that "request desktop site" makes it fit the viewport, whereas using the mobile view it's I guess already fitting the viewport so it highlights the text. The strange thing is in the desktop view, if I pinch zoom after fitting the viewport and do it again, it zooms out, whereas the mobile view still highlights the text. Which kinda makes sense, since mobile view it's fairly likely that you zoomed in to highlight the text more accurately, though it's weird that it's so inconsistent.
sam_lowry_ 21 hours ago

Works great on Firefox for Android though )
- lucasoshiro 21 hours ago
  
  Also works great on Safari on a M1 MacBook Air, here

jllyhill 8 hours ago

Am I the only one having troubles with the site on mobile? I'm using Firefox on a decent Android phone but the scroll is extremely stuttery and it distracts from the article unfortunately.

styanax 7 hours ago
The site is built with a content creation tool which has used a lot of JS and CSS, but the CSS is atrocious in it's automated output so it's triggering the browser to have to interpret the mess of directives in every code block. The tool is generating HTML trash like (brackets replaced for comment to not parse):
```
    [span style="--0:#E1E4E8;--1:#24292E"] [/span]
```
...over and over, essentially giving style directives for every blank space in the code block. A less capable mobile CPU may well have issues rendering this site due to the presence of so much trash CSS inside it guts. $0.02 hth

lemming 12 hours ago

Git refers to the user-friendly commands as “porcelain”

Ahhhhahahaha… “user friendly”. When compared to coding the repo by hand, I guess.

aGHz 7 hours ago

When compared to the "plumbing" commands. If you want to know more about git's plumbing vs porcelain metaphor, this is a good quick overview: https://stackoverflow.com/a/39848551
antonvs 12 hours ago

This is what happens when you let an OS kernel guy write a cli.

mitchitized 5 hours ago

I closed the tab as soon as I saw `ignorecase = true`.

Absolutely NOT going there again.

* points at numerous scars and trauma

HexDecOctBin 17 hours ago

Okay, there's something I have been thinking about recently. Is it possible to somehow make Git use the Content Defined Chunking algorithm from rsync? Maybe somehow using clean/smudge? If not git, then maybe Mercurial, Fossil or any other DVCS?

This would help with large binary assets without having to deal with the mess that is LFS, as long as the assets were uncompressed.

hanwenn 6 hours ago

IIRC it already uses content defined chunking for finding object deltas.

aeblyve 19 hours ago

I thought this was going to be a sardonic article about doing programming without LLMs.

lioeters 18 hours ago

I'm starting to see this kind of wording as a unique selling point, that some software (or article, visual art, etc.) is handcrafted and artisanal, as opposed to AI-generated. "Every word was written by me, a human being!" At this point in the emerging technology I can usually tell the difference intuitively, but it's possible that one day it will be indistinguishable - and the quality of "handmade" will be simply a matter of branding for niche enthusiasts, like vinyl records.
- lan321 6 hours ago
  
  Homegrown bugs from sustainably raised Bio-certified devs vs industrial bugs.

BobbyTables2 19 hours ago

I realize the concept is very similar but would love to see a writeup on bow Docker stores images using OverlayFS. (Has quite a bit of metadata!)

kassah 20 hours ago

The simplicity of Git is awesome. Great article! I had looked at what it would take to find a single file in a remote git repo. I decided against talking the git protocol directly and just checking out the entire repo to get a single file. Reading through this makes me think I may have given up too easily.

I asked a few git hosting providers, and they all said they had private APIs developed internally for the purpose.

iJohnDoe 13 hours ago

What is this web site theme or CMS?

deadbabe 21 hours ago

[flagged]

ChrisMarshallNY 21 hours ago

My understanding is that Mercurial is sort of Beta to Git's VHS. There are some definite advantages, but it's losing support.
- GuB-42 18 hours ago
  
  I am sure that it is because the porn industry settled on Git :)
  Anyways, I started on Mercurial, and I think it has a better UX, but technically I now prefer Git. The success of Mercurial over Git surprised me a little because of that, Git is not an easy version control system to get into, at least when compared to Mercurial, it shouldn't help adoption, but I guess it is just because some big names decided on Git.
  Mercurial and Git use the same fundamental principles, and one is not really better than the other, just details.
zanecodes 20 hours ago

I thought all the cool kids were on Pijul, or was it Darcs? Maybe it was Fossil? No wait, it was definitely Jujutsu.
- jact 18 hours ago
  
  Can confirm that cool kids are definitely using Fossil
MrStonedOne 19 hours ago

[dead]

gerdesj 20 hours ago

This is all very well but how does Linus Thorvalds use git? Given he invented the bloody thing, it might be nice to see how the Boss uses it!

git was created to scratch an itch (actually a bit of a roiling boil, that needed a serious amount of soothing ointment and as it turns out: a compiler, some source code and quite a lot of effort). ... anyway the history of it is well documented.

FFS: git was called git because a Finnish bloke with English as a second, but well used, tongue had learned what a "git" is and it seemed appropriate. Bear in mind that Mr T was deeply in his shouty phase at that point in time.

Artisanal git sounds all kinds of wrong 8) Its just a tool to do a job and I suggest you use it in the same way as the XKCD comic mandates (that is the official manual, despite what you might think)

The Conclusion is spot on - great article.

lysace 21 hours ago

I would have called this: "Futzing around with internal git data structures".

DrBazza 9 hours ago

I'm glad I clicked through to the actual article rather than dismissing it via its slightly silly title. I learnt a few things about git, and I didn't realize that the tool `pigz` existed. Today I learnt...