> The more information you have in the file that's not universally applicable to the tasks you have it working on, the more likely it is that Claude will ignore your instructions in the file
Claude.md files can get pretty long, and many times Claude Code just stops following a lot of the directions specified in the file
A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell Claude is not paying attention to the instructions on Claude.md, when Claude stops calling him “Mr Tinkleberry” consistently
What I’m surprised about is that OP didn’t mention having multiple CLAUDE.md files in each directory, specifically describing the current context / files in there. Eg if you have some database layer and want to document some critical things about that, put it in “src/persistence/CLAUDE.md” instead of the main one.
Claude pulls in those files automatically whenever it tries to read a file in that directory.
I find that to be a very effective technique to leverage CLAUDE.md files and be able to put a lot of content in them, but still keep them focused and avoid context bloat.
READMEs are written for people, CLAUDE.mds are written for coding assistants. I don’t write “CRITICAL (PRIORITY 0):” in READMEs.
The benefit of CLAUDE.md files is that they’re pulled in automatically, eg if Claude wants to read “tests/foo_test.py” it will automatically pull in “tests/CLAUDE.md” (if it exists).
If AI is supposed to deliver on this magical no-lift ease of use task flexibility that everyone likes to talk about I think it should be able to work with a README instead of clogging up ALL of my directories with yet another fucking config file.
Also this isn’t portable to other potential AI tools. Do I need 3+ md files in every directory?
Don’t worry, as of about 6 weeks ago when they changed the system prompt Claude will make sure every folder has way more than 3 .md files seen as it often writes 2 or more per task so if you don’t clean them up…
It’s not delivering on magical stuff. Getting real productivity improvements out of this requires engineering and planning and it needs to be approached as such.
One of the big mistakes I think is that all these tools are over-promising on the “magic” part of it.
It’s not. You need to really learn how to use all these tools effectively. This is not done in days or weeks even, it takes months in the same way becoming proficient in eMacs or vim or a programming language is.
Once you’ve done that, though, it can absolutely enhance productivity. Not 10x, but definitely in the area of 2x. Especially for projects / domains you’re uncomfortable with.
And of course the most important thing is that you need to enjoy all this stuff as well, which I happen to do. I can totally understand the resistance as it’s a shitload of stuff you need to learn, and it may not even be relevant anymore next year.
While I believe you're probably right that getting any productivity gains from these tools requires an investment, I think calling the process "engineering" is really stretching the meaning of the word. It's really closer to ritual magic than any solid engineering practices at this point. People have guesses and practices that may or may not actually work for them (since measuring productivity increases is difficult if not impossible), and they teach others their magic formulas for controlling the demon.
Yeah I feel like on average I still spend a similar amount of time developing but drastically less time fixing obscure bugs, because once it codes the feature and I describe the bugs it fixed them, the rest of my times spent testing and reviewing code.
Learning how to equip a local LLM with tools it can use to interact with to extend its capabilities has been a lot of fun for me and is a great educational experience for anyone who is interested. Just another tool for the toolchest.
Naw man, it's the first point because in April Claude code didn't really gave anything else that somewhat worked.
I tried to use that effectively, I even started a new greenfield project just to make sure to test it under ideal circumstances - and while it somewhat worked, it was always super lackluster and way more effective to explicitly add the context manually via prepared md you just reference in the prompt.
I'd tell anyone to go for skills first before littering your project with these config files everywhere
You could make a hook in Claude to re-inject claude.md. For example, make it say "Mr Tinkleberry" in every response, and failing to do so re-injects the instructions.
I wonder if there are any benefits, side-effects or downsides of everyone using the same fake name for Claude to call them.
If a lot of people always put call me Mr. Tinkleberry in the file will it start calling people Mr. Tinkleberry even when it loses the context because so many people seem to want to be called Mr. Tinkleberry.
That's smart, but I worry that that works only partially; you'll be filling up the context window with conversation turns where the LLM consistently addresses it's user as "Mr. Tinkleberry", thus reinforcing that specifc behavior encoded by CLAUDE.md. I'm not convinced that this way of addressing the user implies that it keeps attention the rest of the file.
I've found that Codex is much better at instruction-following like that, almost to a fault (for example, when I tell it to "always use TDD", it will try to use TDD even when just fixing already-valid-just-needing-expectation-updates tests!
If you have any experience in 3D modeling, I feel it's quite closer to 3D Unwrapping than software development.
You got a bitmap atlas ("context") where you have to cram as much information as possible without losing detail, and then you need to massage both your texture and the structure of your model so that your engine doesn't go mental when trying to map your informations from a 2D to a 3D space.
Likewise, both operations are rarely blemish-free and your ability resides in being able to contain the intrinsic stochastic nature of the tool.
I used to tell it to always start every message with a specific emoji. Of the emoji wasn’t present, I knew the rules were ignored.
But it’s bro reliable enough. It can send the emoji or address you correctly while still ignoring more important rules.
Now I find that it’s best to have a short and tight rules file that references other files where necessary. And to refresh context often. The longer the context window gets, the more likely it is to forget rules and instructions.
I tell it to accomplish only half of what it thinks it can, then conclude with a haiku. That seems to help, because 1) I feel like it starts shedding discipline as it starts feeling token pressure, and 2) I feel like it is more likely to complete task n - 1 than it is to complete task n. I have no idea if this is actually true or not, or if I'm hallucinating... all I can say is that this is the impression I get.
I guess I assumed that it's not highly relevant to the task, but I suppose it depends on interpretation. E.g. if someone tells the bus driver to smile while he drives, it's hopefully clear that actually driving the bus is more important than smiling.
Having experimented with similar config, I found that Claude would adhere to the instructions somewhat reliably at the beginning and end of the conversation, but was likely to ignore during the middle where the real work is being done. Recent versions also seem to be more context-aware, and tend to start rushing to wrap up as the context is nearing compaction. These behaviors seem to support my assumption, but I have no real proof.
> A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell Claude is not paying attention to the instructions on Claude.md, when Claude stops calling him “Mr Tinkleberry” consistently
this is a totally normal thing that everyone does, that no one should view as a signal of a psychotic break from reality...
is your friend in the room with us right now?
I doubt I'll ever understand the lengths AI enjoyers will go though just to avoid any amount of independent thought...
I suspect you’re misjudging the friend here. This sounds more like the famous “no brown m&ms” clause in the Van Halen performance contract. As ridiculous as the request is, it being followed provides strong evidence that the rest (and more meaningful) of the requests are.
Sounds like the friend understands quite well how LLMs actually work and has found a clever way to be signaled when it’s starting to go off the rails.
It's also a common tactic for filtering inbound email.
Mention that people may optionally include some word like 'orange' in the subject line to tell you they've come via some place like your blog or whatever it may be, and have read at least carefully enough to notice this.
Of course ironically that trick's probably trivially broken now because of use of LLMs in spam. But the point stands, it's an old trick.
> I suspect you’re misjudging the friend here. This sounds more like the famous “no brown m&ms” clause in the Van Halen performance contract. As ridiculous as the request is, it being followed provides strong evidence that the rest (and more meaningful) of the requests are.
I'd argue, it's more like you've bought so much into the idea this is reasonable, that you're also willing to go through extreme lengths to recon and pretend like this is sane.
Imagine two different worlds, one where the tools that engineers use, have a clear, and reasonable way to detect and determine if the generative subsystem is still on the rails provided by the controller.
And another world where the interface is completely devoid of any sort of basic introspection interface, and because it's a problematic mess, all the way down, everyone invents some asinine way that they believe provides some sort of signal as to whether or not the random noise generator has gone off the rails.
> Sounds like the friend understands quite well how LLMs actually work and has found a clever way to be signaled when it’s starting to go off the rails.
My point is that while it's a cute hack, if you step back and compare it objectively, to what good engineering would look like. It's wild so many people are all just willing to accept this interface as "functional" because it means they don't have to do the thinking that required to emit the output the AI is able to, via the specific randomness function used.
Imagine these two worlds actually do exist; and instead of using the real interface that provides a clear bool answer to "the generative system has gone off the rails" they *want* to be called Mr Tinkerberry
Which world do you think this example lives in? You could convince me, Mr Tinkleberry is a cute example of the latter, obviously... but it'd take effort to convince me that this reality is half reasonable or that's it's reasonable that people who would want to call themselves engineers should feel proud to be a part of this one.
Before you try to strawman my argument, this isn't a gatekeeping argument. It's only a critical take on the interface options we have to understand something that might as well be magic, because that serves the snakeoil sales much better.
> > Is the magic token machine working?
> Fuck I have no idea dude, ask it to call you a funny name, if it forgets the funny name it's probably broken, and you need to reset it
Yes, I enjoy working with these people and living in this world.
It is kind of wild that not that long ago the general sentiment in software engineering (at least as observed on boards like this one) seemed to be about valuing systems that were understandable, introspectable, with tight feedback loops, within which we could compose layers of abstractions in meaningful and predictable ways (see for example the hugely popular - at the time - works of Chris Granger, Bret Victor, etc).
And now we've made a complete 180 and people are getting excited about proprietary black boxes and "vibe engineering" where you have to pretend like the computer is some amnesic schizophrenic being that you have to coerce into maybe doing your work for you, but you're never really sure whether it's working or not because who wants to read 8000 line code diffs every time you ask them to change something. And never mind if your feedback loops are multiple minutes long because you're waiting on some agent to execute some complex network+GPU bound workflow.
> You don’t think people are trying very hard to understand LLMs? We recognize the value of interpretability. It is just not an easy task.
I think you're arguing against a tangential position to both me, and the person this directly replies to. It can be hard to use and understand something, but if you have a magic box that you can't tell if it's working. It doesn't belong anywhere near the systems that other humans use. The people that use the code you're about to commit to whatever repo you're generating code for, all deserve better than to be part of your unethical science experiment.
> It’s not the first time in human history that our ability to create things has exceeded our capacity to understand.
I don't agree this is a correct interpretation of the current state of generative transformer based AI. But even if you wanted to try to convince me; my point would still be, this belongs in a research lab, not anywhere near prod. And that wouldn't be a controversial idea in the industry.
> It doesn't belong anywhere near the systems that other humans use
Really for those of us who actually work in critical systems (emergency services in my case) - of course we're not going to start patching the core applications with vibe code.
But yeah, that frankenstein reporting script that half a dozen amateur hackers made a mess of over 20 years instead of refactoring and redesigning? That's prime fodder for this stuff. NOBODY wants to clean that stuff up by hand.
Your comment would be more useful if you could point us to some concrete tooling that’s been built out in the last ~3 years that LLM assisted coding has been around to improve interpretability.
This reads like you either have an idealized view of Real Engineering™, or used to work in a stable, extremely regulated area (e.g. civil engineering). I used to work in aerospace in the past, and we had a lot of silly Mr Tinkleberry canaries. We didn't strictly rely on them because our job was "extremely regulated" to put it mildly, but they did save us some time.
There's a ton of pretty stable engineering subfields that involve a lot more intuition than rigor. A lot of things in EE are like that. Anything novel as well. That's how steam in 19th century or aeronautics in the early 20th century felt. Or rocketry in 1950s, for that matter. There's no need to be upset with the fact that some people want to hack explosive stuff together before it becomes a predictable glacier of Real Engineering.
> There's no need to be upset with the fact that some people want to hack explosive stuff together before it becomes a predictable glacier of Real Engineering.
You misunderstand me. I'm not upset that people are playing with explosives. I'm upset that my industry is playing with explosives that all read, "front: face towards users"
And then, more upset that we're all seemingly ok with that.
The driving force of enshittifacation of everything, may be external, but degradation clearly comes from engineers first. These broader industry trends only convince me it's not likely to get better anytime soon, and I don't like how everything is user hostile.
Man I hate this kind of HN comment that makes grand sweeping statement like “that’s how it was with steam in the 19th century or rocketry in the 1950s”, because there’s no way to tell whether you’re just pulling these things out of your… to get internet points or actually have insightful parallels to make.
Could you please elaborate with concrete examples on how aeronautics in the 20th century felt like having a fictional friend in a text file for the token predictor?
We're not going to advance the discussion this way. I also hate this kind of HN comment that makes grand sweeping statement like "LLMs are like having a fictional friend in a text file for the token predictor", because there's no way to tell whether you're just pulling these things out of your... to get internet points or actually have insightful parallels to make.
Yes, during the Wright era aeronautics was absolutely dominated by tinkering, before the aerodynamics was figured out. It wouldn't pass the high standard of Real Engineering.
> Yes, during the Wright era aeronautics was absolutely dominated by tinkering, before the aerodynamics was figured out. It wouldn't pass the high standard of Real Engineering.
Remind me: did the Wright brothers start selling tickets to individuals telling them it was completely safe? Was step 2 of their research building a large passenger plane?
I originally wanted to avoid that specific flight analogy, because it felt a bit too reductive. But while we're being reductive, how about medicine too; the first smallpox vaccine was absolutely not well understood... would that origin story pass ethical review today? What do you think the pragmatics would be if the medical profession encouraged that specific kind of behavior?
> It wouldn't pass the high standard of Real Engineering.
I disagree, I think it 100% is really engineering. Engineering at it's most basic is tricking physics into doing what you want. There's no more perfect example of that than heavier than air flight. But there's a critical difference between engineering research, and experimenting on unwitting people. I don't think users need to know how the sausage is made. That counts equally to planes, bridges, medicine, and code. But the professionals absolutely must. It's disappointing watching the industry I'm a part of willingly eschew understanding to avoid a bit of effort. Such a thing is considered malpractice in "real professions".
Ideally neither of you to wring your hands about the flavor or form of the argument, or poke fun at the gamified comment thread. But if you're gonna complain about adding positively to the discussion, try to add something to it along with the complaints?
As a matter of fact, commercial passenger service started almost immediately as the tech was out of the fiction phase. The airship were large, highly experimental, barely controllable, hydrogen-filled death traps that were marketed as luxurious and safe. First airliners also appeared with big engines and large planes (WWI disrupted this a bit). Nothing of that was built on solid grounds. The adoption was only constrained by the industrial capacity and cost. Most large aircraft were more or less experimental up until the 50's, and aviation in general was unreliable until about 80's.
I would say that right from the start everyone was pretty well aware about the unreliability of LLM-assisted coding and nobody was experimenting on unwitting people or forcing them to adopt it.
>Engineering at it's most basic is tricking physics into doing what you want.
Very well, then Mr Tinkleberry also passes the bar because it's exactly such a trick. That it irks you as a cheap hack that lacks rigor (which it does) is another matter.
I use agents almost all day and I do way more thinking than I used to, this is why I’m now more productive. There is little thinking required to produce output, typing requires very little thinking. The thinking is all in the planning… If the LLM output is bad in any given file I simply change it, obviously this is much faster than typing every character.
I’m spending more time planning and my planning is more comprehensive and faster than it used to be. I’m spending less time producing output, my output is more plentiful and of equal quality. No generated code goes into my commits without me reviewing it for quality. Where is the problem here?
It feels like you’re blaming the AI engineers here, that they built it this way out of ignorance or something. Look into interpretability research. It is a hard problem!
I am blaming the developers who use AI because they're willing to sacrifice intellectual control in trade for something that I find has minimal value.
I agree it's likely to be a complex or intractable problem. But I don't enjoy watching my industry revert down the professionalism scale. Professionals don't choose tools that they can't explain how it works. If your solution to understanding if your tool is still functional is inventing an amusing name and trying to use that as the heuristic, because you have no better way to determine if it's still working correctly. That feels like it might be a problem, no?
The 'canary in the coal mine' approach (like the Mr. Tinkleberry trick) is silly but pragmatic. Until we have deterministic introspection for LLMs, engineers will always invent weird heuristics to detect drift. It's not elegant engineering, but it's effective survival tactics in a non-deterministic loop.
> We recommend keeping task-specific instructions in separate markdown files with self-descriptive names somewhere in your project. Then, in your CLAUDE.md file, you can include a list of these files with a brief description of each, and instruct Claude to decide which (if any) are relevant and to read them before it starts working.
I've been doing this since the early days of agentic coding though I've always personally referred to it as the Table-of-Contents approach to keep the context window relatively streamlined. Here's a snippet of my CLAUDE.md file that demonstrates this approach:
# Documentation References
- When adding CSS, refer to: docs/ADDING_CSS.md
- When adding assets, refer to: docs/ADDING_ASSETS.md
- When working with user data, refer to: docs/STORAGE_MANAGER.md
I think the key here is “if X then Y syntax” - this seems to be quite effective at piercing through the “probably ignore this” system message by highlighting WHEN a given instruction is “highly relevant”
Indeed, the article links to the skill documentation which says:
Skills are modular capabilities that extend Claude’s functionality through organized folders containing instructions, scripts, and resources.
And
Extend Claude’s capabilities for your specific workflows
E.g. building your project is definitely a workflow.
It als makes sense to put as much as you can into a skill as this an optimized mechanism for claude code to retrieve relevant information based on the skill’s frontmatter.
Yeah I think "Skills" are just a more codified folder based approach to this TOC system. The main reason I haven't migrated yet is that the TOC approach lends itself better to the more generic AGENTS.md style - allowing me to swap over to alternative LLMs (such as Gemini) relatively easily.
I don't get the point. Point it at your relevent files ask it to review discuss the update refine it's understanding and then tell it to go.
I have found that more context comments and info damage quality on hard problems.
I actually for a long time now have two views for my code.
1. The raw code with no empty space or comments.
2. Code with comments
I never give the second to my LLM. The more context you give the lower it's upper end of quality becomes. This is just a habit I've picked up using LLMs every day hours a day since gpt3.5 it allows me to reach farther into extreme complexity.
I suppose I don't know what most people are using LLMs for but the higher complexity your work entails the less noise you should inject into it. It's tempting to add massive amounts of xontext but I've routinely found that fails on the higher levels of coding complexity and uniqueness. It was more apparent in earlier models newer ones will handle tons of context you just won't be able to get those upper ends of quality.
Compute to informatio ratio is all that matters. Compute is capped.
> I have found that more context comments and info damage quality on hard problems.
There can be diminishing returns, but every time I’ve used Claude Code for a real project I’ve found myself repeating certain things over and over again and interrupting tool usage until I put it in the Claude notes file.
You shouldn’t try to put everything in there all the time, but putting key info in there has been very high ROI for me.
Disclaimer: I’m a casual user, not a hardcore vibe coder. Claude seems much more capable when you follow the happy path of common projects, but gets constantly turned around when you try to use new frameworks and tools and such.
Setting hooks has been super helpful for me, you can reject certain uses of tools (don’t touch my tests for this session) with just simple scripting code.
Git lint hook has been key. No matter how many times I told it, it lints randomly. Sometimes not at all. Sometime before rubbing tests (but not after fixing test failures).
Agreed, I don't love the CLAUDE.md that gets autogenerated. It's too wordy for me to understand and for the model to follow consistently.
I like to write my CLAUDE.md directly, with just a couple paragraphs describing the codebase at a high level, and then I add details as I see the model making mistakes.
> 1. The raw code with no empty space or comments. 2. Code with comments
I like the sound of this but what technique do you use to maintain consistency across both views? Do you have a post-modification script which will strip comments and extraneous empty space after code has been modified?
As I think more on how this could work, I’d treat the fully commented code as the source of truth (SOT).
1. SOT through a processor to strip comments and extra spaces. Publish to feature branch.
2. Point Claude at feature branch. Prompt for whatever changes you need. This runs against the minimalist feature branch. These changes will be committed with comments and readable spacing for the new code.
3. Verify code changes meet expectations.
4. Diff the changes from minimal version, and merge only that code into SOT.
1. Run into a problem you and AI can't solve.
2. Drop all comments
3. Restart debug/design session
4. Solve it and save results
5. Revert code to have comments and put update in
If that still doesn't work:
Step 2.5 drop all unrelated code from context
IMO within the documentation .md files the information density should be very high. Higher than trying to shove the entire codebase into context that is for sure.
You deffinetly don't just push the entire code base. Previous models required you to be meticulous about your input. A function here a class there.
Even now if I am working on REALLY hard problems I will still manually copy and paste code sections out for discussion and algorithm designs. Depends on complexity.
This is why I still believe open ai O1-Pro was the best model I've ever seen. The amount of compute you could throw at a problem was absurd.
Genuinely curious — how did you isolate the effect of comments/context on model performance from all the other variables that change between sessions (prompt phrasing, model variance, etc)? In other words, how did you validate the hypothesis that "turning off the comments" (assuming you mean stripping them temporarily...) resulted in an objectively superior experience?
What did your comparison process look like? It feels intuitively accurate and validates my anecdotal impression but I'd love to hear the rigor behind your conclusions!
I was already in the habit of copy pasting relevent code sections to maximize reasoning performance to squeeze earlier weaker models performance on stubborn problems. (Still do this on really nasty ones)
It's also easy to notice LLMs create garbage comments that get worse over time. I started deleting all comments manually alongside manual snippet selection to get max performance.
Then started just routinely deleting all comments pre big problem solving session. Was doing it enough to build some automation.
Maybe high quality human comments improve ability? Hard to test in a hybrid code base.
Writing and updating CLAUDE.md or AGENTS.md feels like pointless to me. Humans are the real audience for documentation. The code changes too fast, and LLMs are stateless anyway.
What’s been working is just letting the LLM explore the relevant part of the code to acquire the context, defining the problem or feature, and asking for a couple of ways to tackle it. All in a one short prompt.
That usually gets me solid options to pick and build it out.
And always do, one session for one problem.
This is my lazy approach to getting useful help from an LLM.
I use .md to tell the model about my development workflow. Along the lines of "here's how you lint", "do this to re-generate the API", "this is how you run unit tests", "The sister repositories are cloned here and this is what they are for".
One may argue that these should go in a README.md, but these markdowns are meant to be more streamlined for context, and it's not appropriate to put a one-liner in the imperative tone to fix model behavior in a top-level file like the README.md
I’m definitely interested in reducing token usage techniques. But with one session one problem I’ve never hit a context limit yet, especially when the problem is small and clearly defined using divide-and-conquer. Also, agentic models are improving at tool use and should require fewer tokens. I’ll take as many iterations as needed to ensure the code is correct.
A well-documented codebase lets both developers and agentic models locate relevant code easily. If you treat the model like a teammate, extra docs for LLMs are unnecessary. IMHO. In frontend work, code moves quickly.
There is far much easier way to do this and one that is perfectly aligned with how these tools work.
It is called documenting your code!
Just write what this file is supposed to do in a clear concise way. It acts as a prompt, it provides much needed context specific to the file and it is used only when necessary.
Another tip is to add README.md files where possible and where it helps. What is this folder for? Nobody knows! Write a README.md file. It is not a rocket science.
What people often forget about LLMs is that they are largely trained on public information which means that nothing new needs to be invented.
You don't have to "prompt it just the right way".
What you have to do is to use the same old good best practices.
For the record I do think the AI community tries to unnecessarily reinvent the wheel on crap all the time.
sure, readme.md is a great place to put content. But there's things I'd put in a readme that I'd never put in a claude.md if we want to squeeze the most out of these models.
Further, claude/agents.md have special quality-of-life mechanics with the coding agent harnesses like e.g. `injecting this file into the context window whenever an agent touches this directory, no matter whether the model wants to read it or not`
> What people often forget about LLMs is that they are largely trained on public information which means that nothing new needs to be invented.
I don't think this is relevant at all - when you're working with coding agents, the more you can finesse and manage every token that goes into your model and how its presented, the better results you can get. And the public data that goes into the models is near useless if you're working in a complex codebase, compared to the results you can get if you invest time into how context is collected and presented to your agent.
> For the record I do think the AI community tries to unnecessarily reinvent the wheel on crap all the time.
On Reddit's LLM subreddits people are rediscovering the very basics of software project management as some massive insights daily or very least weekly.
Who would've guessed that proper planning, accessible and up to documentation and splitting tasks into manageable testable chunks produces good code? Amazing!
Then they write a massive blog post or even some MCP mostrosity for it and post it everywhere as a new discovery =)
So how exactly does one "write what this file is supposed to do in a clear concise way" in a way that is quickly comprehensible to AI? The gist of the article is that when your audience changes from "human" to "AI" the manner in which you write documentation changes. The article is fairly high quality, and presents excellent evidence that simply "documenting your code" won't get you as far as the guidelines it provides.
Your comment comes off as if you're dispensing common-sense advice, but I don't think it actually applies here.
Writing documentation for LLMs is strangely pleasing because you have very linear returns for every bit of effort you spend on improving its quality and the feedback loop is very tight. When writing for humans, especially internal documentation, I’ve found that these returns are quickly diminishing or even negative as it’s difficult to know if people even read it or if they didn’t understand it or if it was incomplete.
Well, no. You run pretty fast into context limit (or attention limit for long context models) And the model understand pretty well what code does without documentation.
Theres also a question of processes. How to format code what style of catching to use and how to run the tests, which human keep on the bacl of their head after reading it once or twice but need a constant reminder for llm whose knowledge lifespan is session limited
I’m pretty sure Claude would not work well in my code base if I hadn’t meticulously added docstrings, type hints, and module level documentation. Even if you’re stubbing out code for later implementation, it helps to go ahead and document it so that a code assistant will get a hint of what to do next.
This is missing the point. If I want to instruct Claude to never write a database query that doesn't hit a preexisting index, where exactly am I supposed to document that? You can either choose:
1. A centralized location, like a README (congrats, you've just invented CLAUDE.md)
2. You add a docs folder (congrats, you've just done exactly what the author suggests under Progressive Disclosure)
Moreover, you can't just do it all in a README, for the exact reasons that the author lays out under "CLAUDE.md file length & applicability".
CLAUDE.md simply isn't about telling Claude what all the parts of your code are and how they work. You're right, that's what documenting your code is for. But even if you have READMEs everywhere, Claude has no idea where to put code when it starts a new task. If it has to read all your documentation every time it starts a new task, you're needlessly burning tokens. The whole point is to give Claude important information up front so it doesn't have to read all your docs and fill up its context window searching for the right information on every task.
Think of it this way: incredibly well documented code has everything a new engineer needs to get started on a task, yes. But this engineer has amnesia and forgets everything it's learned after every task. Do you want them to have to reonboard from scratch every time? No! You structure your docs in a way so they don't have to start from scratch every time. This is an accommodation: humans don't need this, for the most part, because we don't reonboard to the same codebase over and over. And so yes, you do need to go above and beyond the "same old good best practices".
1. Create a tool that can check if a query hits a prexisting index
In step 2 either force Claude to use it (hooks) or suggest it (CLAUDE.md)
3. Profit!
As for "where stuff is", for anything more complex I have a tree-style graph in CLAUDE.md that shows the rough categories of where stuff is. Like the handler for letterboxd is in cmd/handlerletterboxd/ and internal modules are in internal/
Now it doesn't need to go in blind but can narrow down searches when I tell it to "add director and writer to the letterboxd handler output".
think about how this thing is interacting with your codebase. it can read one file at a time. sections of files.
in this UX, is it ergonomic to go hunting for patterns and conventions? if u have to linearly process every single thing u look at every time you do something, how are you supposed to have “peripheral vision”? if you have amnesia, how do you continue to do good work in a codebase given you’re a skilled engineer?
it is different from you. that is OK. it doesn’t mean its stupid. it means it needs different accomodations to perform as well as you do. accomodations IRL exist for a reason, different people work differently and have different strengths and weaknesses. just like humans, you get the most out of them if you meet and work with them from where they’re at.
You put a warning where it is most likely to be seen by a human coder.
Besides, no amount of prompting will prevent this situation.
If it is a concern then you put a linter or unit tests to prevent it altogether, or make a wrapper around the tricky function with some warning in its doc strings.
I don't see how this is any different from how you typically approach making your code more resilient to accidental mistakes.
But they are right, claude routinely ignores stuff from CLAUDE.md, even with warning bells etc. You need a linter preventing things. Like drizzle sql` templates: it just loves them.
You can make affordances for agent abilities without deviating from what humans find to be good documentation. Use hyperlinks, organize information, document in layers, use examples, be concise. It's not either/or unless you're being lazy.
> no amount of prompting will prevent this situation.
Again, missing the point. If you don't prompt for it and you document it in a place where the tool won't look first, the tool simply won't do it. "No amount of promoting" couldn't be more wrong, it works for me and all my coworkers.
> If it is a concern then you put a linter or unit tests to prevent it altogether
Sure, and then it'll always do things it's own way, run the tests, and have to correct itself. Needlessly burning tokens. But if you want to pay for it to waste its time and yours, go for it.
> I don't see how this is any different from how you typically approach making your code more resilient to accidental mistakes.
It's not about avoiding mistakes! It's about having it follow the norms of your codebase.
- My codebase at work is slowly transitioning from Mocha to Jest. I can't write a linter to ban new mocha tests, and it would be a pain to keep a list of legacy mocha test suites. The solution is to simply have a bullet point in the CLAUDE.md file that says "don't write new Mocha test suites, only write new test suites in Jest". A more robust solution isn't necessary and doesn't avoid mistakes, it avoids the extra step of telling the LLM to rewrite the tests.
- We have a bunch of terraform modules for convenience when defining new S3 buckets. No amount of documenting the modules will have Claude magically know they exist. You tell it that there are convenience modules and to consider using them.
- Our ORM has findOne that returns one record or null. We have a convenience function getOne that returns a record or throws a NotFoundError to return a 404 error. There's no way to exhaustively detect with a linter that you used findOne and checked the result for null and threw a NotFoundError. And the hassle of maybe catching some instances isn't necessary, because avoiding it is just one line in CLAUDE.md.
Learned this the hard way. Asked Claude Code to run a database migration. It deleted my production database instead, then immediately apologised and started panicking trying to restore it.
Thankfully Azure keeps deleted SQL databases recoverable, so I got it back in under an hour. But yeah - no amount of CLAUDE.md instructions would have prevented that. It no longer gets prod credentials.
Probably a lot of people here disagree with this feeling. But my take is that if setting up all the AI infrastructure and onboarding to my code is going to take this amount of effort, then I might as well code the damn thing myself which is what I'm getting paid to (and enjoy doing anyway)
Minutes really, despite what the article says you can get 90% of the way there by telling Claude how you want the project documentation structured and just let it do it. Up to you if you really want to tune the last 10% manually, I don't. I have been using basically the same system and when I tell Claude to update docs it doesn't revert to one big Claude.md, it maintains it in a structure like this.
If you have a counter-study (for experienced devs, not juniors), I'd be curious to see. My experience also has been that using AI as part of your main way to produce code, is not faster when you factor in everything.
Curious why there hasn't been a rebuttal study to that one yet (or if there is I haven't seen it come up). There must be near infinite funding available to debunk that study right?
I've heard this mentioned a few times. Here is a summarized version of the abstract:
> ... We conduct a randomized controlled trial (RCT)
> ... AI tools ... affect the productivity of experienced
> open-source developers. 16 developers with moderate AI
> experience complete 246 tasks in mature projects on which they
> have an average of 5 years of prior experience. Each task is
> randomly assigned to allow or disallow usage of early-2025 AI
> tools. ... developers primarily use Cursor Pro ... and
> Claude 3.5/3.7 Sonnet. Before starting tasks, developers forecast that allowing
> AI will reduce completion time by 24%. After completing the
> study, developers estimate that allowing AI reduced completion time by 20%.
> Surprisingly, we find that allowing AI actually increases
> completion time by 19%—AI tooling slowed developers down. This
> slowdown also contradicts predictions from experts in economics
> (39% shorter) and ML (38% shorter). To understand this result,
> we collect and evaluate evidence for 21 properties of our setting
> that a priori could contribute to the observed slowdown effect—for
> example, the size and quality standards of projects, or prior
> developer experience with AI tooling. Although the influence of
> experimental artifacts cannot be entirely ruled out, the robustness
> of the slowdown effect across our analyses suggests it is unlikely
> to primarily be a function of our experimental design.
So what we can gather:
1. 16 people were randomly given tasks to do
2. They knew the codebase they worked on pretty well
3. They said AI would help them work 24% faster (before starting tasks)
4. They said AI made them ~20% faster (after completion of tasks)
5. ML Experts claim that they think programmers will be ~38% faster
6. Economists say ~39% faster.
7. We measured that people were actually 19% slower
This seems to be done on Cursor, with big models, on codebases people know. There are definitely problems with industry-wide statements like this but I feel like the biggest area AI tools help me is if I'm working on something I know nothing about. For example: I am really bad at web development so CSS / HTML is easier to edit through prompts. I don't have trouble believing that I would be slower trying to make an edit to code that I already know how to make.
Maybe they would see the speedups by allowing the engineer to select when to use the AI assistance and when not to.
It's a couple of hours right now, then another couple of hours "correcting" the AI when it still goes wrong, another couple of hours tweaking the file again, another couple of hours to update when the model changes, another couple of hours when someone writes a new blog post with another method etc.
There's a huge difference between investing time into a deterministic tool like a text editor or programming language and a moving target like "AI".
The difference between programming in Notepad in a language you don't know and using "AI" will be huge. But the difference between being fluent in a language and having a powerful editor/IDE? Minimal at best. I actually think productivity is worse because it tricks you into wasting time via the "just one more roll" (ie. gambling) mentality. Not to mention you're not building that fluency or toolkit for yourself, making you barely more valuable than the "AI" itself.
You say that as if tech hasn't always been a moving target anyway. The skills I spent months learning a specific language and IDE became obsolete with the next job and the next paradigm shift. That's been one of the few consistent themes throughout my career. Hours here and there, spread across months and years, just learning whatever was new. Sometimes, like with Linux, it really paid off. Other times, like PHP, it did, and then fizzled out.
--
The other thing is, this need for determinism bewilders me. I mean, I get where it comes from, we want nice, predictable reliable machines. But how deterministic does it need to be? If today, it decides to generate code and the variable is called fileName, and tomorrow it's filePath, as long as it's passing tests, what do I care that it's not totally deterministic and the names of the variables it generates are different? as long as it's consistent with existing code, and it passes tests, whats the importance of it being deterministic to a computer science level of rigor? It reminds me about the travelling salesman problem, or the knapsack problem. Both NP hard, but users don't care about that. They just want the computer to tell them something good enough for them to go on about their day. So if a customer comes up to you and offers you a pile of money to solve either one of those problems, do I laugh in their face, knowing damn well I won't be the one to prove that NP = P, or do I explain to them the situation, and build them software that will do the best it can, with however much compute resources they're willing to pay for?
Whether it's setting up AI infrastructure or configuring Emacs/vim/VSCode, the important distinction to make is if the cost has to be paid continually, or if it's a one time/intermittent cost. If I had to configure my shell/git aliases every time I booted my computer, I wouldn't use them, but seeing as how they're saved in config files, they're pretty heavily customized by this point.
Don't use AI if you don't want to, but "it takes too much effort to set up" is an excuse printf debuggers use to avoid setting up a debugger. Which is a whole other debate though.
I fully agree with this POV but for one detail; there is a problem with sunsetting frontier models. As we begin to adopt these tools and build workflows with them, they become pieces of our toolkit. We depend on them. We take them for granted even. And then the model either changes (new checkpoints, maybe alignment gets fiddled with) and all of the sudden prompts no longer yield the same results we expected from them after working on them for quite some time. I think the term for this is "prompt instability".
I felt this with Gemini 3 (and some people had less pronounced but similar experience with Sonnet releases after 3.7) which for certain tasks that 2.5Pro excelled at..it's just unusable now. I was already a local model advocate before this but now I'm a local model zealot. I've stopped using Gemini 3 over this. Last night I used Qwen3 VL on my 4090 and although it was not perfect (sycophancy, overuse of certain cliches...nothing I can't get rid of later with some custom promptsets and a few hours in Heretic) it did a decent enough job of helping me work through my blindspots in the UI/UX for a project that I got what I needed.
If we have to perform tuning on our prompts ("skills", agents.md/claude.md, all of the stuff a coding assistant packs context with) every model release then I see new model releases becoming a liability more than a boon.
If you find it works for you, then that’s great! This post is mostly from our learnings from getting it to solve hard problems in complex brownfield codebases where auto generation is almost never sufficient.
I’m sure I’m just working like a caveman, but I simply highlight the relevant code, add it to the chat, and talk to these tools as if they were my colleagues and I’m getting pretty good results.
About 12 to 6 months ago this was not the case (with or without .md files), I was getting mainly subpar result, so I’m assuming that the models have improved a lot.
Basically, I found that they not make that much of a difference, the model is either good enough or not…
I know (or at least I suppose) that these markdown files could bring some marginal improvements, but at this point, I don’t really care.
I assume this is an unpopular take because I see so many people treat these files as if they were black magic or silver bullet that 100x their already 1000x productivity.
Yep it is opinionated for how to get coding agents to solve hard problems in complex brownfield codebases which is what we are focused on at humanlayer :)
Matches my experience also. Bothered only once to setup a proper CLAUDE.md file, and now never do it. Simply refering to the context properly for surgical recommendations and edit works relatively well.
It feels a lot like bikeshedding to me, maybe I’m wrong
I gave it a tool to execute to get that info if required, but it mostly doesn’t need to due to Kysely migration files and the database type definition being enough.
Even without the explicit magic ORMs, with data mapper style query builders like Kysely and similar, I still find I need to marshall selected rows into objects to, yknow, do things with them in a lot of cases.
Sure, but that's not the same thing. For example, whether or not you have to redeclare your entire database schema in a custom ORM language in a different repo.
I find the Claude.md file mostly useless. It seems to be 50/50 or LESS that Claude.md even reads/uses this file.
You can easily test this by adding some mandatory instruction into the file. E.g. "Any new method you write must have less than 50 lines or code." Then use Claude for ten minutes and watch it blow through this limit again and again.
I use CC and Codex extensively and I constantly am resetting my context and manually pasting my custom instructions in again and again, because these models DO NOT remember or pay attention to Claude.md or Agents.md etc.
I'm not sure if Claude Code has integrated it in its system prompts or not since it's moving at breakneck speed, but one instruction I like putting on all of my projects is to "Prompt for technical decisions from user when choices are unsure". This would almost always trigger the prompting feature that Claude Code has for me when it's got some uncertainty about the instructions I gave it, giving me options or alternatives on how to approach the problem when planning or executing.
This way, it's got more of a chance in generating something that I wanted, rather than running off on it's own.
I have found enabling the codebase itself to be the “Claude.md” to be most effective. In other words, set up effective automated checks for linting, type checking, unit tests etc and tell Claude to always run these before completing a task. If the agent keeps doing something you don’t like, then a linting update or an additional test often is more effective than trying to tinker with the Claude.md file. Also, ensure docs on the codebase are up to date and tell Claude to read relevant parts when working on a task and of course update the docs for each new task. YMMV but this has worked for me.
> Also, ensure docs on the codebase are up to date and tell Claude to read relevant parts when working on a task
Yeah, if you do this every time it works fine. If you add what you tell it every time to CLAUDE.md, it also works fine, but you don’t have to tell it any more ;)
That's a good write up. Very useful to know. I'm sort of on the outside of all this. I've only sort of dabbled and now use copilot quite a lot with claude. What's being said here, reminds me a lot of CPU registers. If you think about the limited space in CPU registers and the processing of information is astounding, how much we're actually able to do. So we actually need higher layers of systems and operating systems to help manage all of this. So it feels like a lot of what's being said. Here will end up inevitably being an automated system or compiler or effectively an operating system. Even something basic like a paging system would make a lot of difference.
I've gotten quite a bit of utility out of my current setup[0]:
Some explicit things I found helpful: Have the agent address you as something specific! This way you know if the agent is paying attention to your detailed instructions.
Rationality, as in the stuff practiced on early Less Wrong, gives a great language for constraining the agent, and since it's read The Sequences and everything else you can include pointers and the more you do the more it will nudge it into that mode of thought.
The explicit "This is what I'm doing, this is what I expect" pattern has been hugely useful for both me monitoring it/coming back to see what it did, and it itself. It makes it more likely to recover when it goes down a bad path.
The system reminder this article mentions is definitely there but I have not noticed it messing much with adherence. I wish there were some sort of power user mode to turn it off though!
Also, this is probably too long! But I have been experimenting and iterating for a while, and this is what is working best currently. Not that I've been able to hold any other part constant -- Opus 4.5 really is remarkable.
I've recently started using a similar approach for my own projects. providing a high-level architecture overview in a single markdown file really helps the LLM understand the 'why' behind the code, not just the 'how'.
Does anyone have a specific structure or template for Claude.md that works best for frontend-heavy projects (like React/Vite)? I find that's where the context window often gets cluttered.
I have Claude itself write CLAUDE.md. Once it is informed of its context (e.g., "README.md is for users, CLAUDE.md is for you") you can say things like, "update readme and claudemd" and it will do it. I find this especially useful for prompts like, "update claudemd to make absolutely certain that you check the API docs every single time before making assumptions about its behavior" — I don't need to know what magick spell will make that happen, just that it does happen.
Do you have any proof that AI written instructions are better than human ones? I don't see why an AI would have an innate understanding on how best to prompt itself.
Having been through cycles of manual writing with '#' and having it do it itself, it seems to have been a push on efficacy while spending less effort and getting less frustrated. Hard to quantify except to say that I've had great results with it. I appreciate the spirit of OP's, "CLAUDE.md is the highest leverage point of the harness, so avoid auto-generating it" but you can always ask Claude to tighten it up itself too.
Generally speaking it has a lot of information from things like OP's blog post on how best to structure the file and prompt itself and you can also (from within Claude Code) ask it to look at posts or Anthropic prompting best practices and adopt those to your own file.
I find writing a good CLAUDE.md is done by running /init, and having the LLM write it. If you need more controls on how it should work, I would highly recommend you implement it in an unavoidable way via hooks and not in a handwritten note to your LLM.
That paper the article references is old at this point. No GPT 5.1, no Gemini 3, which both were game changers. I'd love to see their instruction following graphs.
Yes README.md should still be written for humans and isn’t going away anytime soon.
CLAUDE.md is a convention used by claude code, and AGENTS.md is used by other coding agents. Both are intended to be supplemental to the README and are deterministically injected into the agent’s context.
It’s a configuration point for the harness, it’s not intended to replace the README.
Some of the advice in here will undoubtedly age poorly as harnesses change and models improve, but some of the generic principles will stay the same - e.g. that you shouldn’t use an LLM to do a linter &formatter’s job, or that LLMs are stateless and need to be onboarded into the codebase, and having some deterministically-injected instructions to achieve that is useful instead of relying on the agent to non-deterministically derive all that info by reading config and package files
The post isn’t really intended to be super forward-looking as much as “here’s how to use this coding agent harness configuration point as best as we know how to right now”
> you shouldn’t use an LLM to do a linter &formatter’s job,
Why is that good advice? If that thing is eventually supposed to do the most tricky coding tasks, and already a year ago could have won a medal at the informatics olympics, then why wouldn't it eventually be able to tell if I'm using 2 or 4 spaces and format my code accordingly? Either it's going to change the world, then this is a trivial task, or it's all vaporware, then what are we even discussing..
> or that LLMs are stateless and need to be onboarded into the codebase
What? Why would that be a reasonable assumption/prediction for even near term agent capabilities? Providing it with some kind of local memory to dump its learned-so-far state of the world shouldn't be too hard. Isn't it supposed to already be treated like a junior dev? All junior devs I'm working with remember what I told them 2 weeks ago. Surely a coding agent can eventually support that too.
This whole CLAUDE.md thing seems a temporary kludge until such basic features are sorted out, and I'm seriously surprised how much time folks are spending to make that early broken state less painful to work with. All that precious knowledge y'all are building will be worthless a year or two from now.
> Why is that good advice? If that thing is eventually supposed to do the most tricky coding tasks, and already a year ago could have won a medal at the informatics olympics, then why wouldn't it eventually be able to tell if I'm using 2 or 4 spaces and format my code accordingly? Either it's going to change the world, then this is a trivial task, or it's all vaporware, then what are we even discussing..
This is the exact reason for the advice: The LLM already is able to follow coding conventions by just looking at the surrounding code which was already included in the context. So by adding your coding conventions to the claude.md, you are just using more context for no gain.
And another reason to not use an agent for linting/formatting(i.e. prompting to "format this code for me") is that dedicated linters/formatters are faster and only take maybe a single cent of electricity to run whereas using an LLM to do that job will cost multiple dollars if not more.
> Then why wouldn't it eventually be able to tell if I'm using 2 or 4 spaces and format my code accordingly?
It's not that an agent doesn't know if you're using 2 or 4 spaces in your code; it comes down to:
- there are many ways to ensure your code is formatted correctly; that's what .editorconfig [1] is for.
- in a halfway serious project, incorrectly formatted code shouldn't reach the LLM in the first place
- tokens are relatively cheap but they're not free on a paid plan; why spend tokens on something linters and formatters can do deterministically and for free?
If you wanted Claude Code to handle linting automatically, you're better off taking that out of CLAUDE.md and creating a Skill [2].
> What? Why would that be a reasonable assumption/prediction for even near-term agent capabilities? Providing it with some kind of local memory to dump its learned-so-far state of the world shouldn't be too hard. Isn't it supposed to already be treated like a junior dev? All junior devs I'm working with remember what I told them 2 weeks ago. Surely a coding agent can eventually support that too.
It wasn't mentioned in the article, but Claude Code, for example, does save each chat session by default. You can come back to a project and type `claude --resume` and you'll get a list of past Claude Code sessions that you can pick up from where you left off.
The stateless nature of Claude code is what annoys me so much. Like it has to spend so much time doing repetitious bootstraps. And how much it “picks up and propagates” random shit it finds in some document it wrote. It will echo back something it wrote that “stood out” and I’ll forget where it got that and ask “find where you found that info so we can remove it.” And it will do so but somehow mysteriously pick it up again and it will be because of some git commit message or something. It’s like a tune stuck in its head or something only it’s sticky for LLMs not humans.
And that describes the issues I had with “automatic memories” features things like ChatGPT had. Turns out it is an awful judge of things to remember. Like it would make memories like “cruffle is trying to make pepper soup with chicken stock”! Which it would then parrot back to me at some point 4 months later and I’d be like “WTF I figured it out”. The “# remember this” is much more powerful because know how sticky this stuff gets and id rather have it over index on my own forceful memories than random shit it decided.
I dunno. All I’m saying is you are right. The future is in having these things do a better job of remembering. And I don’t know if LLMs are the right tool for that. Keyword search isn’t either though. And vector search might not be either—I think it suffers from the same kinds of “catchy tune attack” an LLM might.
I think this is an overall good approach and I've got allright results with a similar approach - I still think that this CLAUDE.md experience is too magical and that Anthropic should really focus on it.
Actually having official guidelines in their docs would be a good entrypoint, even though I guess we have this which is the closest available from anything official for now: https://www.claude.com/blog/using-claude-md-files
One interesting thing I also noticed and used recently is that Claude Code ships with a @agent-claude-code-guide. I've used it to review and update my dev workflow / CLAUDE.md file but I've got mixed feelings on the discussion with the subagent.
The advice here seems to assume a single .md file with instructions for the whole project, but the AGENTS.md methodology as supported by agents like github copilot is to break out more specific AGENTS.md files in the subdirectories in your code base. I wonder how and if the tips shared change assuming a flow with a bunch of focused AGENTS.md files throughout the code.
<system-reminder>
IMPORTANT: this context may or may not be relevant to your tasks.
You should not respond to this context unless it is highly relevant to your task.
</system-reminder>
Perhaps a small proxy between Claude code and the API to enforce following CLAUDE.md may improve things… I may try this
Interesting selection of models for the "instruction count vs. accuracy" plot. Curious when that was done and why they chose those models. How well does ChatGPT 5/5.1 (and codex/mini/nano variants), Gemini 3, Claude Haiku/Sonnet/Opus 4.5, recent grok models, Kimi 2 Thinking etc (this generation of models) do?
Sure - I was more commenting that they are all > 6 months old, which sounds silly, but things have been changing fast, and instruction following is definitely an area that has been developing a lot recently. I would be surprised if accuracy drops off that hard still.
I imagine it’s highly-correlated to parameter count, but the research is a few months old and frontier model architecture is pretty opaque so hard to draw too too many conclusions about newer models that aren’t in the study besides what I wrote in the post
"You can investigate this yourself by putting a logging proxy between the claude code CLI and the Anthropic API using ANTHROPIC_BASE_URL" I'd be eager to read a tutorial about that I never know which tool to favour for doing that when you're not a system or network expert.
agree - i've had claude one-shot this for me at least 10 times at this point cause i'm too lazy to lug whatever code around. literally made a new one this morning
I already forgot CLAUDE.md, I generate and update it by AI, I prefer to keep design, tasks, docs folder instead. It is always better to ask it to read a
some spec docs and read the real code first before doing anything.
It seems overall a good set of guidelines. I appreciate some of the observations being backed up by data.
What I find most interesting is how a hierarchical / recursive context construct begins to emerge. The authors' note of "root" claude.md as well as the opening comments on LLMs being stateless ring to me like a bell. I think soon we will start seeing stateful LLMs, via clever manipulation of scope and context. Something akin to memory, as we humans perceive it.
Here is my take, on writing a good claude.md.
I had very good results with my 3 file approach. And it has also been inspired by the great blog posts that Human Layer is publishing from time to time
https://github.com/marcuspuchalla/claude-project-management
Has anyone had success getting Claude to write it's own Claude.md file? It should be able to deduce rules by looking at the code, documentation, and PR comments.
The main failure state I find is that Claude wants to write an incredibly verbose Claude.md, but if I instruct it "one sentence per topic, be concise" it usually does a good job.
That said, a lot of what it can deduce by looking at the code is exactly what you shouldn't include, since it will usually deduce that stuff just by interacting with the code base. Claude doesn't seem good at that.
An example of both overly-verbose and unnecessary:
### 1. Identify the Working Directory
When a user asks you to work on something:
1. *Check which project* they're referring to
2. *Change to that directory* explicitly if needed
3. *Stay in that directory* for file operations
```bash
# Example: Working on ProjectAlpha
cd /home/user/code/ProjectAlpha
```
(The one sentence version is "Each project has a subfolder; use pwd to make sure you're in the right directory", and the ideal version is probably just letting it occasionally spend 60 seconds confused, until it remembers pwd exists)
If you have any substantial codebase, it will write a massive file unless you explicitly tell it not to. It also will try and make updates, including garbage like historical or transitional changes, project status, etc...
I think most people who use Claude regularly have probably come to the same conclusions as the article. A few bits of high-level info, some behavior stuff, and pointers to actual docs. Load docs as-needed, either by prompt or by skill. Work through lists and constantly update status so you can clear context and pick up where you left off. Any other approach eats too much context.
If you have a complex feature that would require ingesting too many large docs, you can ask Claude to determine exactly what it needs to build the appropriate context for that feature and save that to a context doc that you load at the beginning of each session.
I copy/pasted it into my codebase to see if it’s any good and now Claude is refusing to do any work? I asked Copilot to investigate why Claude is not working but it too is not working. Do you know what happened?
Honestly I’d rather google get their gemini tool in better shape. I know for a fact it doesn’t ignore instructions like Claude code does but it is horrible at editing files.
What's the actual completion rate for Advent of Code? I'd bet the majority of participants drop off before day 25, even among those aiming to complete it.
Is this intentional? Is AoC designed as an elite challenge, or is the journey more important than finishing?
I think this could work really well for infrastructure/ops style work where the LLM will not be able to grasp the full context of say the network from just a few files that you have open.
But as others are saying this is just basic documentation that should be done anyway.
Ha, I just tell Claude to write it. My results have been generally fine, but I only use Claude on a simple codebase that is well documented already. Maybe I will hand-edit it to see if I can see any improvements.
> Regardless of which model you're using, you may notice that Claude frequently ignores your CLAUDE.md file's contents.
This is a news for me. And at the same time it isn’t. Without the knowledge of how the models actually work, most of the prompting is guesstimate at best. You have no control over models via prompts.
I've been very satisfied with creating a short AGENTS.md file with the project basics, and then also including references to where to find more information / context, like a /context folder that has markdown files such as app-description.md.
I was expecting the traditional AI-written slop about AI, but this is actually really good. In particular, the "As instruction count increases, instruction-following quality decreases uniformly" section and associated graph is truly fantastic! To my mind, the ability to follow long lists of rules is one of the most obvious ways that virtually all AI models fail today. That's why I think that graph is so useful -- I've never seen someone go and systematically measure it before!
I would love to see it extended to show Codex, which to my mind is by far the best at rule-following. (I'd also be curious to see how Gemini 3 performs.)
Funny how this is exactly the documentation you'd need to make it easy for a human to work with the codebase. Perhaps this'll be the greatest thing about LLMs -- they force people to write developer guides for their code. Of course, people are going to ask an LLM to write the CLAUDE.md and then it'll just be more slop...
It's not exactly the doc you'd need for a human. There could be overlap, but each side may also have unique requirements that aren't necessarily suitable for the other. E.g. a doc for a human may have considerably more information than you'd want to give to the agent, or, you may want to define agent behavior for workflows that don't apply to a human.
Also, while it may be hip to call any LLM output slop, that really isn't the case. Look at what a poor history we have of developer documentation. LLMs may not be great at everything, but they're actually quite capable when it comes to technical documentation. Even a 1-shot attempt by LLM is often way better than many devs who either can't write very well, or just can't be bothered to.
I've been a customer since sonnet 3.5. It is coming to the point where opus 4.5 usually does better than whatever your instructions say on claude.md just by reading your code and having a general sense of what your preferences are.
I used to instruct about coding style (prefer functions, avoid classes, use structs for complex params and returns, avoid member functions unless needed by shared state, avoid superfluous comments, avoid silly utf8 glyphs, AoS vs SoA, dry, etc)
I removed all my instructions and it basically never violates those points.
it's always funny, i think the opposite. I use a massive CLAUDE.md file, but it's targetted towards very specific details of what to do, and what not to do.
I have a full system of agents, hooks, skills, and commands, and it all works for me quite well.
I believe is massive context, but targetted context. It has to be valuable, and important.
My agents are large. My skills are large. Etc etc.
I would recommend using it, yeah. You have limited context and it will be compacted/summarized occasionally. The compaction/summary will lose some information and it is easy for it to forget certain instructions you gave it. Afaik claude.md will be loaded into the context on every compaction which allows you to use it for instructions that should always be included in the context.
"Here's how to use the slop machine better" is such a ridiculous pretense for a blog or article. You simply write a sentence and it approximates it. That is hardly worth any literature being written as it is so self obvious.
This is an excellent point - LLMs are autoregressive next-token predictors, and output token quality is a function of input token quality
Consider that if the only code you get out of the autoregressive token prediction machine is slop, that this indicates more about the quality of your code than the quality of the autoregressive token prediction machine
I copied this post and gave it to claude code, and had it self-modify CLAUDE.md. It.. worked really well.
> Claude often ignores CLAUDE.md
> The more information you have in the file that's not universally applicable to the tasks you have it working on, the more likely it is that Claude will ignore your instructions in the file
Claude.md files can get pretty long, and many times Claude Code just stops following a lot of the directions specified in the file
A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell Claude is not paying attention to the instructions on Claude.md, when Claude stops calling him “Mr Tinkleberry” consistently
That’s hilarious and a great way to test this.
What I’m surprised about is that OP didn’t mention having multiple CLAUDE.md files in each directory, specifically describing the current context / files in there. Eg if you have some database layer and want to document some critical things about that, put it in “src/persistence/CLAUDE.md” instead of the main one.
Claude pulls in those files automatically whenever it tries to read a file in that directory.
I find that to be a very effective technique to leverage CLAUDE.md files and be able to put a lot of content in them, but still keep them focused and avoid context bloat.
Ummm… sounds like that directory should have a readme. And Claude should read readme files.
READMEs are written for people, CLAUDE.mds are written for coding assistants. I don’t write “CRITICAL (PRIORITY 0):” in READMEs.
The benefit of CLAUDE.md files is that they’re pulled in automatically, eg if Claude wants to read “tests/foo_test.py” it will automatically pull in “tests/CLAUDE.md” (if it exists).
If AI is supposed to deliver on this magical no-lift ease of use task flexibility that everyone likes to talk about I think it should be able to work with a README instead of clogging up ALL of my directories with yet another fucking config file.
Also this isn’t portable to other potential AI tools. Do I need 3+ md files in every directory?
> Do I need 3+ md files in every directory?
Don’t worry, as of about 6 weeks ago when they changed the system prompt Claude will make sure every folder has way more than 3 .md files seen as it often writes 2 or more per task so if you don’t clean them up…
Strange. I haven’t experienced this a single time and I use it almost all day everyday.
That is strange because it's been going on since sonnet 4.5 release.
Is your logic that unless something is perfect it should not be used even though it is delivering massive productivity gains?
> it is delivering massive productivity gains
[citation needed]
Every article I can find about this is citing the valuation of the S&P500 as evidence of the productivity gains, and that feels very circular
It’s not delivering on magical stuff. Getting real productivity improvements out of this requires engineering and planning and it needs to be approached as such.
One of the big mistakes I think is that all these tools are over-promising on the “magic” part of it.
It’s not. You need to really learn how to use all these tools effectively. This is not done in days or weeks even, it takes months in the same way becoming proficient in eMacs or vim or a programming language is.
Once you’ve done that, though, it can absolutely enhance productivity. Not 10x, but definitely in the area of 2x. Especially for projects / domains you’re uncomfortable with.
And of course the most important thing is that you need to enjoy all this stuff as well, which I happen to do. I can totally understand the resistance as it’s a shitload of stuff you need to learn, and it may not even be relevant anymore next year.
While I believe you're probably right that getting any productivity gains from these tools requires an investment, I think calling the process "engineering" is really stretching the meaning of the word. It's really closer to ritual magic than any solid engineering practices at this point. People have guesses and practices that may or may not actually work for them (since measuring productivity increases is difficult if not impossible), and they teach others their magic formulas for controlling the demon.
Most countries don’t have a notion of a formally licensed software engineer, anyway. Arguing what is and is not engineering is not useful.
>> [..] and it may not even be relevant anymore next year.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yeah I feel like on average I still spend a similar amount of time developing but drastically less time fixing obscure bugs, because once it codes the feature and I describe the bugs it fixed them, the rest of my times spent testing and reviewing code.
Learning how to equip a local LLM with tools it can use to interact with to extend its capabilities has been a lot of fun for me and is a great educational experience for anyone who is interested. Just another tool for the toolchest.
> “CRITICAL (PRIORITY 0):”
There's no need for this level of performative ridiculousness with AGENTS.md (Codex) directives, FYI.
Is this documented anywhere? This is the first I have ever heard of it.
Here: https://www.anthropic.com/engineering/claude-code-best-pract...
claude.md seems to be important enough to be their very first point in that document.
Naw man, it's the first point because in April Claude code didn't really gave anything else that somewhat worked.
I tried to use that effectively, I even started a new greenfield project just to make sure to test it under ideal circumstances - and while it somewhat worked, it was always super lackluster and way more effective to explicitly add the context manually via prepared md you just reference in the prompt.
I'd tell anyone to go for skills first before littering your project with these config files everywhere
I often can't tell the difference between my Readme and Claude files to the point that I cannibalise the Claude file for the Readme.
It's the difference between instructions for a user and instructions for a developer, but in coding projects that's not much different.
You could make a hook in Claude to re-inject claude.md. For example, make it say "Mr Tinkleberry" in every response, and failing to do so re-injects the instructions.
We are back to color-sorted M&Ms bowls.
I wonder if there are any benefits, side-effects or downsides of everyone using the same fake name for Claude to call them.
If a lot of people always put call me Mr. Tinkleberry in the file will it start calling people Mr. Tinkleberry even when it loses the context because so many people seem to want to be called Mr. Tinkleberry.
Then you switch to another name.
I have a /bootstrap command that I run which instructs Claude Code to read all system and project CLAUDE.md files, skills and commands.
Helps me quickly whip it back in line.
Mind sharing it? (As long as it doesn’t involve anything private.)
Isn’t that what every new session does?
That also clears the context; a command would just append to the context.
That's smart, but I worry that that works only partially; you'll be filling up the context window with conversation turns where the LLM consistently addresses it's user as "Mr. Tinkleberry", thus reinforcing that specifc behavior encoded by CLAUDE.md. I'm not convinced that this way of addressing the user implies that it keeps attention the rest of the file.
I've found that Codex is much better at instruction-following like that, almost to a fault (for example, when I tell it to "always use TDD", it will try to use TDD even when just fixing already-valid-just-needing-expectation-updates tests!
It baffles me how people can be happy working like this. "I wrap the hammer in paper so if the paper breaks I know the hammer has turned into a saw."
If you have any experience in 3D modeling, I feel it's quite closer to 3D Unwrapping than software development.
You got a bitmap atlas ("context") where you have to cram as much information as possible without losing detail, and then you need to massage both your texture and the structure of your model so that your engine doesn't go mental when trying to map your informations from a 2D to a 3D space.
Likewise, both operations are rarely blemish-free and your ability resides in being able to contain the intrinsic stochastic nature of the tool.
You could think of it as art or creativity.
> It Is Difficult to Get a Man to Understand Something When His Salary Depends Upon His Not Understanding It
probably by not thinking in ridiculous analogies that don't help
I used to tell it to always start every message with a specific emoji. Of the emoji wasn’t present, I knew the rules were ignored.
But it’s bro reliable enough. It can send the emoji or address you correctly while still ignoring more important rules.
Now I find that it’s best to have a short and tight rules file that references other files where necessary. And to refresh context often. The longer the context window gets, the more likely it is to forget rules and instructions.
The green m&M's trick of AI instructions.
I've used that a couple times, e.g. "Conclude your communications with "Purple fish" at the end"
Claude definitely picks and chooses when purple fish will show up
I tell it to accomplish only half of what it thinks it can, then conclude with a haiku. That seems to help, because 1) I feel like it starts shedding discipline as it starts feeling token pressure, and 2) I feel like it is more likely to complete task n - 1 than it is to complete task n. I have no idea if this is actually true or not, or if I'm hallucinating... all I can say is that this is the impression I get.
The article explains why that's not a very good test however.
Why not? It's relevant for all tasks, and just adds 1 line
I guess I assumed that it's not highly relevant to the task, but I suppose it depends on interpretation. E.g. if someone tells the bus driver to smile while he drives, it's hopefully clear that actually driving the bus is more important than smiling.
Having experimented with similar config, I found that Claude would adhere to the instructions somewhat reliably at the beginning and end of the conversation, but was likely to ignore during the middle where the real work is being done. Recent versions also seem to be more context-aware, and tend to start rushing to wrap up as the context is nearing compaction. These behaviors seem to support my assumption, but I have no real proof.
It will also let the LLM process even more tokens, thus decreasing it's accuracy
> A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell Claude is not paying attention to the instructions on Claude.md, when Claude stops calling him “Mr Tinkleberry” consistently
this is a totally normal thing that everyone does, that no one should view as a signal of a psychotic break from reality...
is your friend in the room with us right now?
I doubt I'll ever understand the lengths AI enjoyers will go though just to avoid any amount of independent thought...
I suspect you’re misjudging the friend here. This sounds more like the famous “no brown m&ms” clause in the Van Halen performance contract. As ridiculous as the request is, it being followed provides strong evidence that the rest (and more meaningful) of the requests are.
Sounds like the friend understands quite well how LLMs actually work and has found a clever way to be signaled when it’s starting to go off the rails.
It's also a common tactic for filtering inbound email.
Mention that people may optionally include some word like 'orange' in the subject line to tell you they've come via some place like your blog or whatever it may be, and have read at least carefully enough to notice this.
Of course ironically that trick's probably trivially broken now because of use of LLMs in spam. But the point stands, it's an old trick.
Apart from the fact that not even every human would read this and add it to the subject, this would still work.
I doubt there is any spam machine out there the quickly tries to find peoples personal blog before sending them viagra mail.
If you are being targeted personally, then of course all bets are off, but that would’ve been the case with or without the subject-line-trick
Could try asking for a seahorse emoji in addition…
> I suspect you’re misjudging the friend here. This sounds more like the famous “no brown m&ms” clause in the Van Halen performance contract. As ridiculous as the request is, it being followed provides strong evidence that the rest (and more meaningful) of the requests are.
I'd argue, it's more like you've bought so much into the idea this is reasonable, that you're also willing to go through extreme lengths to recon and pretend like this is sane.
Imagine two different worlds, one where the tools that engineers use, have a clear, and reasonable way to detect and determine if the generative subsystem is still on the rails provided by the controller.
And another world where the interface is completely devoid of any sort of basic introspection interface, and because it's a problematic mess, all the way down, everyone invents some asinine way that they believe provides some sort of signal as to whether or not the random noise generator has gone off the rails.
> Sounds like the friend understands quite well how LLMs actually work and has found a clever way to be signaled when it’s starting to go off the rails.
My point is that while it's a cute hack, if you step back and compare it objectively, to what good engineering would look like. It's wild so many people are all just willing to accept this interface as "functional" because it means they don't have to do the thinking that required to emit the output the AI is able to, via the specific randomness function used.
Imagine these two worlds actually do exist; and instead of using the real interface that provides a clear bool answer to "the generative system has gone off the rails" they *want* to be called Mr Tinkerberry
Which world do you think this example lives in? You could convince me, Mr Tinkleberry is a cute example of the latter, obviously... but it'd take effort to convince me that this reality is half reasonable or that's it's reasonable that people who would want to call themselves engineers should feel proud to be a part of this one.
Before you try to strawman my argument, this isn't a gatekeeping argument. It's only a critical take on the interface options we have to understand something that might as well be magic, because that serves the snakeoil sales much better.
> > Is the magic token machine working?
> Fuck I have no idea dude, ask it to call you a funny name, if it forgets the funny name it's probably broken, and you need to reset it
Yes, I enjoy working with these people and living in this world.
It is kind of wild that not that long ago the general sentiment in software engineering (at least as observed on boards like this one) seemed to be about valuing systems that were understandable, introspectable, with tight feedback loops, within which we could compose layers of abstractions in meaningful and predictable ways (see for example the hugely popular - at the time - works of Chris Granger, Bret Victor, etc).
And now we've made a complete 180 and people are getting excited about proprietary black boxes and "vibe engineering" where you have to pretend like the computer is some amnesic schizophrenic being that you have to coerce into maybe doing your work for you, but you're never really sure whether it's working or not because who wants to read 8000 line code diffs every time you ask them to change something. And never mind if your feedback loops are multiple minutes long because you're waiting on some agent to execute some complex network+GPU bound workflow.
You don’t think people are trying very hard to understand LLMs? We recognize the value of interpretability. It is just not an easy task.
It’s not the first time in human history that our ability to create things has exceeded our capacity to understand.
> You don’t think people are trying very hard to understand LLMs? We recognize the value of interpretability. It is just not an easy task.
I think you're arguing against a tangential position to both me, and the person this directly replies to. It can be hard to use and understand something, but if you have a magic box that you can't tell if it's working. It doesn't belong anywhere near the systems that other humans use. The people that use the code you're about to commit to whatever repo you're generating code for, all deserve better than to be part of your unethical science experiment.
> It’s not the first time in human history that our ability to create things has exceeded our capacity to understand.
I don't agree this is a correct interpretation of the current state of generative transformer based AI. But even if you wanted to try to convince me; my point would still be, this belongs in a research lab, not anywhere near prod. And that wouldn't be a controversial idea in the industry.
> It doesn't belong anywhere near the systems that other humans use
Really for those of us who actually work in critical systems (emergency services in my case) - of course we're not going to start patching the core applications with vibe code.
But yeah, that frankenstein reporting script that half a dozen amateur hackers made a mess of over 20 years instead of refactoring and redesigning? That's prime fodder for this stuff. NOBODY wants to clean that stuff up by hand.
Your comment would be more useful if you could point us to some concrete tooling that’s been built out in the last ~3 years that LLM assisted coding has been around to improve interpretability.
[dead]
This reads like you either have an idealized view of Real Engineering™, or used to work in a stable, extremely regulated area (e.g. civil engineering). I used to work in aerospace in the past, and we had a lot of silly Mr Tinkleberry canaries. We didn't strictly rely on them because our job was "extremely regulated" to put it mildly, but they did save us some time.
There's a ton of pretty stable engineering subfields that involve a lot more intuition than rigor. A lot of things in EE are like that. Anything novel as well. That's how steam in 19th century or aeronautics in the early 20th century felt. Or rocketry in 1950s, for that matter. There's no need to be upset with the fact that some people want to hack explosive stuff together before it becomes a predictable glacier of Real Engineering.
> There's no need to be upset with the fact that some people want to hack explosive stuff together before it becomes a predictable glacier of Real Engineering.
You misunderstand me. I'm not upset that people are playing with explosives. I'm upset that my industry is playing with explosives that all read, "front: face towards users"
And then, more upset that we're all seemingly ok with that.
The driving force of enshittifacation of everything, may be external, but degradation clearly comes from engineers first. These broader industry trends only convince me it's not likely to get better anytime soon, and I don't like how everything is user hostile.
Man I hate this kind of HN comment that makes grand sweeping statement like “that’s how it was with steam in the 19th century or rocketry in the 1950s”, because there’s no way to tell whether you’re just pulling these things out of your… to get internet points or actually have insightful parallels to make.
Could you please elaborate with concrete examples on how aeronautics in the 20th century felt like having a fictional friend in a text file for the token predictor?
We're not going to advance the discussion this way. I also hate this kind of HN comment that makes grand sweeping statement like "LLMs are like having a fictional friend in a text file for the token predictor", because there's no way to tell whether you're just pulling these things out of your... to get internet points or actually have insightful parallels to make.
Yes, during the Wright era aeronautics was absolutely dominated by tinkering, before the aerodynamics was figured out. It wouldn't pass the high standard of Real Engineering.
> Yes, during the Wright era aeronautics was absolutely dominated by tinkering, before the aerodynamics was figured out. It wouldn't pass the high standard of Real Engineering.
Remind me: did the Wright brothers start selling tickets to individuals telling them it was completely safe? Was step 2 of their research building a large passenger plane?
I originally wanted to avoid that specific flight analogy, because it felt a bit too reductive. But while we're being reductive, how about medicine too; the first smallpox vaccine was absolutely not well understood... would that origin story pass ethical review today? What do you think the pragmatics would be if the medical profession encouraged that specific kind of behavior?
> It wouldn't pass the high standard of Real Engineering.
I disagree, I think it 100% is really engineering. Engineering at it's most basic is tricking physics into doing what you want. There's no more perfect example of that than heavier than air flight. But there's a critical difference between engineering research, and experimenting on unwitting people. I don't think users need to know how the sausage is made. That counts equally to planes, bridges, medicine, and code. But the professionals absolutely must. It's disappointing watching the industry I'm a part of willingly eschew understanding to avoid a bit of effort. Such a thing is considered malpractice in "real professions".
Ideally neither of you to wring your hands about the flavor or form of the argument, or poke fun at the gamified comment thread. But if you're gonna complain about adding positively to the discussion, try to add something to it along with the complaints?
As a matter of fact, commercial passenger service started almost immediately as the tech was out of the fiction phase. The airship were large, highly experimental, barely controllable, hydrogen-filled death traps that were marketed as luxurious and safe. First airliners also appeared with big engines and large planes (WWI disrupted this a bit). Nothing of that was built on solid grounds. The adoption was only constrained by the industrial capacity and cost. Most large aircraft were more or less experimental up until the 50's, and aviation in general was unreliable until about 80's.
I would say that right from the start everyone was pretty well aware about the unreliability of LLM-assisted coding and nobody was experimenting on unwitting people or forcing them to adopt it.
>Engineering at it's most basic is tricking physics into doing what you want.
Very well, then Mr Tinkleberry also passes the bar because it's exactly such a trick. That it irks you as a cheap hack that lacks rigor (which it does) is another matter.
I use agents almost all day and I do way more thinking than I used to, this is why I’m now more productive. There is little thinking required to produce output, typing requires very little thinking. The thinking is all in the planning… If the LLM output is bad in any given file I simply change it, obviously this is much faster than typing every character.
I’m spending more time planning and my planning is more comprehensive and faster than it used to be. I’m spending less time producing output, my output is more plentiful and of equal quality. No generated code goes into my commits without me reviewing it for quality. Where is the problem here?
This could be a very niche standup comedy routine, I approve.
It feels like you’re blaming the AI engineers here, that they built it this way out of ignorance or something. Look into interpretability research. It is a hard problem!
I am blaming the developers who use AI because they're willing to sacrifice intellectual control in trade for something that I find has minimal value.
I agree it's likely to be a complex or intractable problem. But I don't enjoy watching my industry revert down the professionalism scale. Professionals don't choose tools that they can't explain how it works. If your solution to understanding if your tool is still functional is inventing an amusing name and trying to use that as the heuristic, because you have no better way to determine if it's still working correctly. That feels like it might be a problem, no?
The 'canary in the coal mine' approach (like the Mr. Tinkleberry trick) is silly but pragmatic. Until we have deterministic introspection for LLMs, engineers will always invent weird heuristics to detect drift. It's not elegant engineering, but it's effective survival tactics in a non-deterministic loop.
From the article:
> We recommend keeping task-specific instructions in separate markdown files with self-descriptive names somewhere in your project. Then, in your CLAUDE.md file, you can include a list of these files with a brief description of each, and instruct Claude to decide which (if any) are relevant and to read them before it starts working.
I've been doing this since the early days of agentic coding though I've always personally referred to it as the Table-of-Contents approach to keep the context window relatively streamlined. Here's a snippet of my CLAUDE.md file that demonstrates this approach:
Full CLAUDE.md file for reference:https://gist.github.com/scpedicini/179626cfb022452bb39eff10b...
I have also done this, but my results are very hit or miss. Claude rarely actually reads the other documentation files I point it to.
Yeah I don't trust any agent to follow document references consistently. I just manually add the relevant files to context every single time.
Though I know some people who have built an mcp that does exactly this: https://www.usable.dev/
It's basically a chat-bot frontend to your markdown files, with both rag and graph db indexes.
I think the key here is “if X then Y syntax” - this seems to be quite effective at piercing through the “probably ignore this” system message by highlighting WHEN a given instruction is “highly relevant”
What?
Correct me if I'm wrong but I think the new "skillss are exactly this, but better.
Indeed, the article links to the skill documentation which says:
Skills are modular capabilities that extend Claude’s functionality through organized folders containing instructions, scripts, and resources.
And
Extend Claude’s capabilities for your specific workflows
E.g. building your project is definitely a workflow.
It als makes sense to put as much as you can into a skill as this an optimized mechanism for claude code to retrieve relevant information based on the skill’s frontmatter.
Yeah I think "Skills" are just a more codified folder based approach to this TOC system. The main reason I haven't migrated yet is that the TOC approach lends itself better to the more generic AGENTS.md style - allowing me to swap over to alternative LLMs (such as Gemini) relatively easily.
I don't get the point. Point it at your relevent files ask it to review discuss the update refine it's understanding and then tell it to go.
I have found that more context comments and info damage quality on hard problems.
I actually for a long time now have two views for my code.
1. The raw code with no empty space or comments. 2. Code with comments
I never give the second to my LLM. The more context you give the lower it's upper end of quality becomes. This is just a habit I've picked up using LLMs every day hours a day since gpt3.5 it allows me to reach farther into extreme complexity.
I suppose I don't know what most people are using LLMs for but the higher complexity your work entails the less noise you should inject into it. It's tempting to add massive amounts of xontext but I've routinely found that fails on the higher levels of coding complexity and uniqueness. It was more apparent in earlier models newer ones will handle tons of context you just won't be able to get those upper ends of quality.
Compute to informatio ratio is all that matters. Compute is capped.
> I have found that more context comments and info damage quality on hard problems.
There can be diminishing returns, but every time I’ve used Claude Code for a real project I’ve found myself repeating certain things over and over again and interrupting tool usage until I put it in the Claude notes file.
You shouldn’t try to put everything in there all the time, but putting key info in there has been very high ROI for me.
Disclaimer: I’m a casual user, not a hardcore vibe coder. Claude seems much more capable when you follow the happy path of common projects, but gets constantly turned around when you try to use new frameworks and tools and such.
Setting hooks has been super helpful for me, you can reject certain uses of tools (don’t touch my tests for this session) with just simple scripting code.
Git lint hook has been key. No matter how many times I told it, it lints randomly. Sometimes not at all. Sometime before rubbing tests (but not after fixing test failures).
Agreed, I don't love the CLAUDE.md that gets autogenerated. It's too wordy for me to understand and for the model to follow consistently.
I like to write my CLAUDE.md directly, with just a couple paragraphs describing the codebase at a high level, and then I add details as I see the model making mistakes.
> 1. The raw code with no empty space or comments. 2. Code with comments
I like the sound of this but what technique do you use to maintain consistency across both views? Do you have a post-modification script which will strip comments and extraneous empty space after code has been modified?
Custom scripts and basic merge logic but manual still happens around modifications. Forces me to update stale comments around changes anyhow.
I first "discovered" it because I repeatedly found LLM comments poisoned my code base over time and linited it's upper end of ability.
Easy to try just drop comments around a problem and see the difference. I was previously doing that and then manually updating the original.
Curious if that is the case, how you would put comments back too? Seems like a mess.
As I think more on how this could work, I’d treat the fully commented code as the source of truth (SOT).
1. SOT through a processor to strip comments and extra spaces. Publish to feature branch.
2. Point Claude at feature branch. Prompt for whatever changes you need. This runs against the minimalist feature branch. These changes will be committed with comments and readable spacing for the new code.
3. Verify code changes meet expectations.
4. Diff the changes from minimal version, and merge only that code into SOT.
Repeat.
Just test it, maybe you won't get a boost.
1. Run into a problem you and AI can't solve. 2. Drop all comments 3. Restart debug/design session 4. Solve it and save results 5. Revert code to have comments and put update in
If that still doesn't work: Step 2.5 drop all unrelated code from context
This is exactly right. Attention is all you need. It's all about attention. Attention is finite.
The more you data load into context the more you dilute attention.
people who criticize LLMs for merely regurgitating statistically related token sequences have very clearly never read a single HN comment
IMO within the documentation .md files the information density should be very high. Higher than trying to shove the entire codebase into context that is for sure.
You deffinetly don't just push the entire code base. Previous models required you to be meticulous about your input. A function here a class there.
Even now if I am working on REALLY hard problems I will still manually copy and paste code sections out for discussion and algorithm designs. Depends on complexity.
This is why I still believe open ai O1-Pro was the best model I've ever seen. The amount of compute you could throw at a problem was absurd.
Genuinely curious — how did you isolate the effect of comments/context on model performance from all the other variables that change between sessions (prompt phrasing, model variance, etc)? In other words, how did you validate the hypothesis that "turning off the comments" (assuming you mean stripping them temporarily...) resulted in an objectively superior experience?
What did your comparison process look like? It feels intuitively accurate and validates my anecdotal impression but I'd love to hear the rigor behind your conclusions!
I was already in the habit of copy pasting relevent code sections to maximize reasoning performance to squeeze earlier weaker models performance on stubborn problems. (Still do this on really nasty ones)
It's also easy to notice LLMs create garbage comments that get worse over time. I started deleting all comments manually alongside manual snippet selection to get max performance.
Then started just routinely deleting all comments pre big problem solving session. Was doing it enough to build some automation.
Maybe high quality human comments improve ability? Hard to test in a hybrid code base.
> I never give the second to my LLM.
How do you practically achieve this? Honest question. Thanks
Custom scripts.
1. Turn off 2. Code 3. Turn on 4. Commit
I also delete all llm comments they 100% poison your codebase.
>> 1. The raw code with no empty space or comments. 2. Code with comments
> 1. Turn off 2. Code 3. Turn on 4. Commit
What does it mean "turn off" / "turn on"?
Do you have a script to strip comments?
Okay, after the comments were stripped, does this become the common base for 3-way merge?
After modification of the code stripped of the comments, do you apply 3-way merge to reconcile the changes and the comments?
This seems a lot of work. What is the benefit? I mean demonstrable benefit.
How does it compare to instructing through AGENTS.md to ignore all comments?
Telling an AI to ignore comments != no comments that's pretty fundamental to get my point.
>> 1. The raw code with no empty space or comments. 2. Code with comments
> 1. Turn off 2. Code 3. Turn on 4. Commit
So can you describe your "turn off" / "turn on" process in practical terms?
Asking simply because saying "Custom scripts" is similar to saying "magic".
The comments are what makes the model understand your code much better.
See it as a human, the comments are there to speed up understanding of the code.
could u share some more intuition as to why you started believing that? are there ANY comments that are useful?
Writing and updating CLAUDE.md or AGENTS.md feels like pointless to me. Humans are the real audience for documentation. The code changes too fast, and LLMs are stateless anyway. What’s been working is just letting the LLM explore the relevant part of the code to acquire the context, defining the problem or feature, and asking for a couple of ways to tackle it. All in a one short prompt. That usually gets me solid options to pick and build it out. And always do, one session for one problem. This is my lazy approach to getting useful help from an LLM.
This is true but sometimes your codebase has unique quirks that you get tired of repeating. "No, Claude, we do it this other way here. Every time."
I use .md to tell the model about my development workflow. Along the lines of "here's how you lint", "do this to re-generate the API", "this is how you run unit tests", "The sister repositories are cloned here and this is what they are for".
One may argue that these should go in a README.md, but these markdowns are meant to be more streamlined for context, and it's not appropriate to put a one-liner in the imperative tone to fix model behavior in a top-level file like the README.md
That kind of repetitive process belongs in a script, rather than baked into markdown prompts. Claude has custom hooks for that.
I agree with you, however your approach results in much longer LLM development runs, increased token usage and a whole lot of repetitive iterations.
I’m definitely interested in reducing token usage techniques. But with one session one problem I’ve never hit a context limit yet, especially when the problem is small and clearly defined using divide-and-conquer. Also, agentic models are improving at tool use and should require fewer tokens. I’ll take as many iterations as needed to ensure the code is correct.
Because it's stateless it's not pointless? Good codebases don't change fast. Stuff gets added but for the most stuff, they shouldn't change.
A well-documented codebase lets both developers and agentic models locate relevant code easily. If you treat the model like a teammate, extra docs for LLMs are unnecessary. IMHO. In frontend work, code moves quickly.
There is far much easier way to do this and one that is perfectly aligned with how these tools work.
It is called documenting your code!
Just write what this file is supposed to do in a clear concise way. It acts as a prompt, it provides much needed context specific to the file and it is used only when necessary.
Another tip is to add README.md files where possible and where it helps. What is this folder for? Nobody knows! Write a README.md file. It is not a rocket science.
What people often forget about LLMs is that they are largely trained on public information which means that nothing new needs to be invented.
You don't have to "prompt it just the right way".
What you have to do is to use the same old good best practices.
For the record I do think the AI community tries to unnecessarily reinvent the wheel on crap all the time.
sure, readme.md is a great place to put content. But there's things I'd put in a readme that I'd never put in a claude.md if we want to squeeze the most out of these models.
Further, claude/agents.md have special quality-of-life mechanics with the coding agent harnesses like e.g. `injecting this file into the context window whenever an agent touches this directory, no matter whether the model wants to read it or not`
> What people often forget about LLMs is that they are largely trained on public information which means that nothing new needs to be invented.
I don't think this is relevant at all - when you're working with coding agents, the more you can finesse and manage every token that goes into your model and how its presented, the better results you can get. And the public data that goes into the models is near useless if you're working in a complex codebase, compared to the results you can get if you invest time into how context is collected and presented to your agent.
> For the record I do think the AI community tries to unnecessarily reinvent the wheel on crap all the time.
On Reddit's LLM subreddits people are rediscovering the very basics of software project management as some massive insights daily or very least weekly.
Who would've guessed that proper planning, accessible and up to documentation and splitting tasks into manageable testable chunks produces good code? Amazing!
Then they write a massive blog post or even some MCP mostrosity for it and post it everywhere as a new discovery =)
So how exactly does one "write what this file is supposed to do in a clear concise way" in a way that is quickly comprehensible to AI? The gist of the article is that when your audience changes from "human" to "AI" the manner in which you write documentation changes. The article is fairly high quality, and presents excellent evidence that simply "documenting your code" won't get you as far as the guidelines it provides.
Your comment comes off as if you're dispensing common-sense advice, but I don't think it actually applies here.
Writing documentation for LLMs is strangely pleasing because you have very linear returns for every bit of effort you spend on improving its quality and the feedback loop is very tight. When writing for humans, especially internal documentation, I’ve found that these returns are quickly diminishing or even negative as it’s difficult to know if people even read it or if they didn’t understand it or if it was incomplete.
Well, no. You run pretty fast into context limit (or attention limit for long context models) And the model understand pretty well what code does without documentation.
Theres also a question of processes. How to format code what style of catching to use and how to run the tests, which human keep on the bacl of their head after reading it once or twice but need a constant reminder for llm whose knowledge lifespan is session limited
I’m pretty sure Claude would not work well in my code base if I hadn’t meticulously added docstrings, type hints, and module level documentation. Even if you’re stubbing out code for later implementation, it helps to go ahead and document it so that a code assistant will get a hint of what to do next.
This is missing the point. If I want to instruct Claude to never write a database query that doesn't hit a preexisting index, where exactly am I supposed to document that? You can either choose:
1. A centralized location, like a README (congrats, you've just invented CLAUDE.md)
2. You add a docs folder (congrats, you've just done exactly what the author suggests under Progressive Disclosure)
Moreover, you can't just do it all in a README, for the exact reasons that the author lays out under "CLAUDE.md file length & applicability".
CLAUDE.md simply isn't about telling Claude what all the parts of your code are and how they work. You're right, that's what documenting your code is for. But even if you have READMEs everywhere, Claude has no idea where to put code when it starts a new task. If it has to read all your documentation every time it starts a new task, you're needlessly burning tokens. The whole point is to give Claude important information up front so it doesn't have to read all your docs and fill up its context window searching for the right information on every task.
Think of it this way: incredibly well documented code has everything a new engineer needs to get started on a task, yes. But this engineer has amnesia and forgets everything it's learned after every task. Do you want them to have to reonboard from scratch every time? No! You structure your docs in a way so they don't have to start from scratch every time. This is an accommodation: humans don't need this, for the most part, because we don't reonboard to the same codebase over and over. And so yes, you do need to go above and beyond the "same old good best practices".
1. Create a tool that can check if a query hits a prexisting index
In step 2 either force Claude to use it (hooks) or suggest it (CLAUDE.md)
3. Profit!
As for "where stuff is", for anything more complex I have a tree-style graph in CLAUDE.md that shows the rough categories of where stuff is. Like the handler for letterboxd is in cmd/handlerletterboxd/ and internal modules are in internal/
Now it doesn't need to go in blind but can narrow down searches when I tell it to "add director and writer to the letterboxd handler output".
This CLAUDE.md dance feels like herding cats. Except we’re herding a really good autocorrect encyclopedic parrot. Sans intelligence
Relating / personifying LLM to an engineer doesn’t work out
Maybe the best though model currently is just “good way to automate trivial text modifications” and “encyclopedic ramblings”
unfair characterization.
think about how this thing is interacting with your codebase. it can read one file at a time. sections of files.
in this UX, is it ergonomic to go hunting for patterns and conventions? if u have to linearly process every single thing u look at every time you do something, how are you supposed to have “peripheral vision”? if you have amnesia, how do you continue to do good work in a codebase given you’re a skilled engineer?
it is different from you. that is OK. it doesn’t mean its stupid. it means it needs different accomodations to perform as well as you do. accomodations IRL exist for a reason, different people work differently and have different strengths and weaknesses. just like humans, you get the most out of them if you meet and work with them from where they’re at.
You put a warning where it is most likely to be seen by a human coder.
Besides, no amount of prompting will prevent this situation.
If it is a concern then you put a linter or unit tests to prevent it altogether, or make a wrapper around the tricky function with some warning in its doc strings.
I don't see how this is any different from how you typically approach making your code more resilient to accidental mistakes.
Documenting for AI exactly like you would document for a human is ignoring how these tools work
But they are right, claude routinely ignores stuff from CLAUDE.md, even with warning bells etc. You need a linter preventing things. Like drizzle sql` templates: it just loves them.
You can make affordances for agent abilities without deviating from what humans find to be good documentation. Use hyperlinks, organize information, document in layers, use examples, be concise. It's not either/or unless you're being lazy.
Sounds like we should call them tools, not AI!
Agentic AI is LLMs using tools in a loop to achieve a goal.
Needs a better term than "AI", I agree, but it's 99% marketing the tech will stay the same.
> no amount of prompting will prevent this situation.
Again, missing the point. If you don't prompt for it and you document it in a place where the tool won't look first, the tool simply won't do it. "No amount of promoting" couldn't be more wrong, it works for me and all my coworkers.
> If it is a concern then you put a linter or unit tests to prevent it altogether
Sure, and then it'll always do things it's own way, run the tests, and have to correct itself. Needlessly burning tokens. But if you want to pay for it to waste its time and yours, go for it.
> I don't see how this is any different from how you typically approach making your code more resilient to accidental mistakes.
It's not about avoiding mistakes! It's about having it follow the norms of your codebase.
- My codebase at work is slowly transitioning from Mocha to Jest. I can't write a linter to ban new mocha tests, and it would be a pain to keep a list of legacy mocha test suites. The solution is to simply have a bullet point in the CLAUDE.md file that says "don't write new Mocha test suites, only write new test suites in Jest". A more robust solution isn't necessary and doesn't avoid mistakes, it avoids the extra step of telling the LLM to rewrite the tests.
- We have a bunch of terraform modules for convenience when defining new S3 buckets. No amount of documenting the modules will have Claude magically know they exist. You tell it that there are convenience modules and to consider using them.
- Our ORM has findOne that returns one record or null. We have a convenience function getOne that returns a record or throws a NotFoundError to return a 404 error. There's no way to exhaustively detect with a linter that you used findOne and checked the result for null and threw a NotFoundError. And the hassle of maybe catching some instances isn't necessary, because avoiding it is just one line in CLAUDE.md.
It's really not that hard.
> There's no way to exhaustively detect with a linter that you used findOne and checked the result for null and threw a NotFoundError
Yes there is? Though this is usually better served with a type checker, it’s still totally feasible with a linter too if that’s your bag
> because avoiding it is just one line in CLAUDE.md.
Except no, it isn’t, because these tools still ignore that line sometimes so I still have to check for it myself.
Learned this the hard way. Asked Claude Code to run a database migration. It deleted my production database instead, then immediately apologised and started panicking trying to restore it.
Thankfully Azure keeps deleted SQL databases recoverable, so I got it back in under an hour. But yeah - no amount of CLAUDE.md instructions would have prevented that. It no longer gets prod credentials.
> 1. A centralized location, like a README (congrats, you've just invented CLAUDE.md)
README files are not a new concept, and have been used in software for like 5 decades now, whereas CLAUDE.md files were invented 12 months ago...
I think you’re missing that CLAUDE.md is deterministically injected into the model’s context window
This means that instead of behaving like a file the LLM reads, it effectively lets you customize the model’s prompt
I also didn’t write that you have to “prompt it just the right way”, I think you’re missing the point entirely
Probably a lot of people here disagree with this feeling. But my take is that if setting up all the AI infrastructure and onboarding to my code is going to take this amount of effort, then I might as well code the damn thing myself which is what I'm getting paid to (and enjoy doing anyway)
The effort described in the article is maybe a couple hours of work.
I understand the "enjoy doing anyway" part and it resonates, but not using AI is simply less productive.
Minutes really, despite what the article says you can get 90% of the way there by telling Claude how you want the project documentation structured and just let it do it. Up to you if you really want to tune the last 10% manually, I don't. I have been using basically the same system and when I tell Claude to update docs it doesn't revert to one big Claude.md, it maintains it in a structure like this.
> but not using AI is simply less productive
Some studies shows the opposite for experienced devs. And it also shows that developers are delusional about said productivity gains: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
If you have a counter-study (for experienced devs, not juniors), I'd be curious to see. My experience also has been that using AI as part of your main way to produce code, is not faster when you factor in everything.
Curious why there hasn't been a rebuttal study to that one yet (or if there is I haven't seen it come up). There must be near infinite funding available to debunk that study right?
That study is garbo and I suspect you didn't even read the abstract. Am I right?
I've heard this mentioned a few times. Here is a summarized version of the abstract:
So what we can gather:1. 16 people were randomly given tasks to do
2. They knew the codebase they worked on pretty well
3. They said AI would help them work 24% faster (before starting tasks)
4. They said AI made them ~20% faster (after completion of tasks)
5. ML Experts claim that they think programmers will be ~38% faster
6. Economists say ~39% faster.
7. We measured that people were actually 19% slower
This seems to be done on Cursor, with big models, on codebases people know. There are definitely problems with industry-wide statements like this but I feel like the biggest area AI tools help me is if I'm working on something I know nothing about. For example: I am really bad at web development so CSS / HTML is easier to edit through prompts. I don't have trouble believing that I would be slower trying to make an edit to code that I already know how to make.
Maybe they would see the speedups by allowing the engineer to select when to use the AI assistance and when not to.
it doesnt control for skill using models/experience using models. this looks VERY different at hour 1000 and hour 5000 than hour 100.
Lazy from me to not check if I remember well or not, but the dev that got productivity gains was a regular user of cursor.
It's a couple of hours right now, then another couple of hours "correcting" the AI when it still goes wrong, another couple of hours tweaking the file again, another couple of hours to update when the model changes, another couple of hours when someone writes a new blog post with another method etc.
There's a huge difference between investing time into a deterministic tool like a text editor or programming language and a moving target like "AI".
The difference between programming in Notepad in a language you don't know and using "AI" will be huge. But the difference between being fluent in a language and having a powerful editor/IDE? Minimal at best. I actually think productivity is worse because it tricks you into wasting time via the "just one more roll" (ie. gambling) mentality. Not to mention you're not building that fluency or toolkit for yourself, making you barely more valuable than the "AI" itself.
You say that as if tech hasn't always been a moving target anyway. The skills I spent months learning a specific language and IDE became obsolete with the next job and the next paradigm shift. That's been one of the few consistent themes throughout my career. Hours here and there, spread across months and years, just learning whatever was new. Sometimes, like with Linux, it really paid off. Other times, like PHP, it did, and then fizzled out.
--
The other thing is, this need for determinism bewilders me. I mean, I get where it comes from, we want nice, predictable reliable machines. But how deterministic does it need to be? If today, it decides to generate code and the variable is called fileName, and tomorrow it's filePath, as long as it's passing tests, what do I care that it's not totally deterministic and the names of the variables it generates are different? as long as it's consistent with existing code, and it passes tests, whats the importance of it being deterministic to a computer science level of rigor? It reminds me about the travelling salesman problem, or the knapsack problem. Both NP hard, but users don't care about that. They just want the computer to tell them something good enough for them to go on about their day. So if a customer comes up to you and offers you a pile of money to solve either one of those problems, do I laugh in their face, knowing damn well I won't be the one to prove that NP = P, or do I explain to them the situation, and build them software that will do the best it can, with however much compute resources they're willing to pay for?
A lot of the style stuff you can write once and reuse. I started splitting mine into overall and project specific files for this reason
Universal has stuff I always want (use uv instead of pip etc) while the other describes what tech choice for this project
Perhaps. But keep in mind that the setup work is typically mostly delegated to LLMs as well.
It really doesn't take that much effort. Like any tool, people can over-optimise on the setup rather than just use it.
Whether it's setting up AI infrastructure or configuring Emacs/vim/VSCode, the important distinction to make is if the cost has to be paid continually, or if it's a one time/intermittent cost. If I had to configure my shell/git aliases every time I booted my computer, I wouldn't use them, but seeing as how they're saved in config files, they're pretty heavily customized by this point.
Don't use AI if you don't want to, but "it takes too much effort to set up" is an excuse printf debuggers use to avoid setting up a debugger. Which is a whole other debate though.
I fully agree with this POV but for one detail; there is a problem with sunsetting frontier models. As we begin to adopt these tools and build workflows with them, they become pieces of our toolkit. We depend on them. We take them for granted even. And then the model either changes (new checkpoints, maybe alignment gets fiddled with) and all of the sudden prompts no longer yield the same results we expected from them after working on them for quite some time. I think the term for this is "prompt instability". I felt this with Gemini 3 (and some people had less pronounced but similar experience with Sonnet releases after 3.7) which for certain tasks that 2.5Pro excelled at..it's just unusable now. I was already a local model advocate before this but now I'm a local model zealot. I've stopped using Gemini 3 over this. Last night I used Qwen3 VL on my 4090 and although it was not perfect (sycophancy, overuse of certain cliches...nothing I can't get rid of later with some custom promptsets and a few hours in Heretic) it did a decent enough job of helping me work through my blindspots in the UI/UX for a project that I got what I needed.
If we have to perform tuning on our prompts ("skills", agents.md/claude.md, all of the stuff a coding assistant packs context with) every model release then I see new model releases becoming a liability more than a boon.
I strongly disagree with the author not using /init. It takes a minute to run and Claude provides surprisingly good results.
If you find it works for you, then that’s great! This post is mostly from our learnings from getting it to solve hard problems in complex brownfield codebases where auto generation is almost never sufficient.
/init has evolved since the early day; it's more concise than it used to be.
I’m sure I’m just working like a caveman, but I simply highlight the relevant code, add it to the chat, and talk to these tools as if they were my colleagues and I’m getting pretty good results.
About 12 to 6 months ago this was not the case (with or without .md files), I was getting mainly subpar result, so I’m assuming that the models have improved a lot.
Basically, I found that they not make that much of a difference, the model is either good enough or not…
I know (or at least I suppose) that these markdown files could bring some marginal improvements, but at this point, I don’t really care.
I assume this is an unpopular take because I see so many people treat these files as if they were black magic or silver bullet that 100x their already 1000x productivity.
> I simply highlight the relevant code, add it to the chat, and talk to these tools
Different use case. I assume the discussion is about having the agent implement whole features or research and fix bugs without much guidance.
Yep it is opinionated for how to get coding agents to solve hard problems in complex brownfield codebases which is what we are focused on at humanlayer :)
Matches my experience also. Bothered only once to setup a proper CLAUDE.md file, and now never do it. Simply refering to the context properly for surgical recommendations and edit works relatively well.
It feels a lot like bikeshedding to me, maybe I’m wrong
How about a list of existing database tables/columns so you don't need to repeat it each time?
I gave it a tool to execute to get that info if required, but it mostly doesn’t need to due to Kysely migration files and the database type definition being enough.
Claude code figures that out at startup every time. Never had issues with it.
You can save some precious context by having it somewhere without it having to figure it out from scratch every time.
Do you not use a model file for your orm?
ORMs are generally a bad idea, so.. hopefully not?
Even without the explicit magic ORMs, with data mapper style query builders like Kysely and similar, I still find I need to marshall selected rows into objects to, yknow, do things with them in a lot of cases.
Perhaps a function of GraphQL though.
Sure, but that's not the same thing. For example, whether or not you have to redeclare your entire database schema in a custom ORM language in a different repo.
=== myExperience
I find the Claude.md file mostly useless. It seems to be 50/50 or LESS that Claude.md even reads/uses this file.
You can easily test this by adding some mandatory instruction into the file. E.g. "Any new method you write must have less than 50 lines or code." Then use Claude for ten minutes and watch it blow through this limit again and again.
I use CC and Codex extensively and I constantly am resetting my context and manually pasting my custom instructions in again and again, because these models DO NOT remember or pay attention to Claude.md or Agents.md etc.
I'm not sure if Claude Code has integrated it in its system prompts or not since it's moving at breakneck speed, but one instruction I like putting on all of my projects is to "Prompt for technical decisions from user when choices are unsure". This would almost always trigger the prompting feature that Claude Code has for me when it's got some uncertainty about the instructions I gave it, giving me options or alternatives on how to approach the problem when planning or executing.
This way, it's got more of a chance in generating something that I wanted, rather than running off on it's own.
I have found enabling the codebase itself to be the “Claude.md” to be most effective. In other words, set up effective automated checks for linting, type checking, unit tests etc and tell Claude to always run these before completing a task. If the agent keeps doing something you don’t like, then a linting update or an additional test often is more effective than trying to tinker with the Claude.md file. Also, ensure docs on the codebase are up to date and tell Claude to read relevant parts when working on a task and of course update the docs for each new task. YMMV but this has worked for me.
> Also, ensure docs on the codebase are up to date and tell Claude to read relevant parts when working on a task
Yeah, if you do this every time it works fine. If you add what you tell it every time to CLAUDE.md, it also works fine, but you don’t have to tell it any more ;)
> Claude.md
It’s case sensitive btw. CLAUDE.md - Might explain your mixed results with it
[dead]
PSA: Claude can also use .github/copilot-instructions.md
If you're using VSCode, that is automatically added to context (and I think in Zed that happens as well, although I can't verify right now).
That's a good write up. Very useful to know. I'm sort of on the outside of all this. I've only sort of dabbled and now use copilot quite a lot with claude. What's being said here, reminds me a lot of CPU registers. If you think about the limited space in CPU registers and the processing of information is astounding, how much we're actually able to do. So we actually need higher layers of systems and operating systems to help manage all of this. So it feels like a lot of what's being said. Here will end up inevitably being an automated system or compiler or effectively an operating system. Even something basic like a paging system would make a lot of difference.
I've gotten quite a bit of utility out of my current setup[0]:
Some explicit things I found helpful: Have the agent address you as something specific! This way you know if the agent is paying attention to your detailed instructions.
Rationality, as in the stuff practiced on early Less Wrong, gives a great language for constraining the agent, and since it's read The Sequences and everything else you can include pointers and the more you do the more it will nudge it into that mode of thought.
The explicit "This is what I'm doing, this is what I expect" pattern has been hugely useful for both me monitoring it/coming back to see what it did, and it itself. It makes it more likely to recover when it goes down a bad path.
The system reminder this article mentions is definitely there but I have not noticed it messing much with adherence. I wish there were some sort of power user mode to turn it off though!
Also, this is probably too long! But I have been experimenting and iterating for a while, and this is what is working best currently. Not that I've been able to hold any other part constant -- Opus 4.5 really is remarkable.
[0]: https://gist.github.com/ctoth/d8e629209ff1d9748185b9830fa4e7...
I've recently started using a similar approach for my own projects. providing a high-level architecture overview in a single markdown file really helps the LLM understand the 'why' behind the code, not just the 'how'. Does anyone have a specific structure or template for Claude.md that works best for frontend-heavy projects (like React/Vite)? I find that's where the context window often gets cluttered.
I have Claude itself write CLAUDE.md. Once it is informed of its context (e.g., "README.md is for users, CLAUDE.md is for you") you can say things like, "update readme and claudemd" and it will do it. I find this especially useful for prompts like, "update claudemd to make absolutely certain that you check the API docs every single time before making assumptions about its behavior" — I don't need to know what magick spell will make that happen, just that it does happen.
Do you have any proof that AI written instructions are better than human ones? I don't see why an AI would have an innate understanding on how best to prompt itself.
Having been through cycles of manual writing with '#' and having it do it itself, it seems to have been a push on efficacy while spending less effort and getting less frustrated. Hard to quantify except to say that I've had great results with it. I appreciate the spirit of OP's, "CLAUDE.md is the highest leverage point of the harness, so avoid auto-generating it" but you can always ask Claude to tighten it up itself too.
Generally speaking it has a lot of information from things like OP's blog post on how best to structure the file and prompt itself and you can also (from within Claude Code) ask it to look at posts or Anthropic prompting best practices and adopt those to your own file.
This will start to break down after a while unless you have a small project, for reasons being described in the article.
I find writing a good CLAUDE.md is done by running /init, and having the LLM write it. If you need more controls on how it should work, I would highly recommend you implement it in an unavoidable way via hooks and not in a handwritten note to your LLM.
Even better: learn to code yourself.
That paper the article references is old at this point. No GPT 5.1, no Gemini 3, which both were game changers. I'd love to see their instruction following graphs.
Same!
None of this should be necessary if these tools did what they say on the tin, and most of this advice will probably age like milk.
Write readmes for humans, not LLMs. That's where the ball is going.
Hi, post author here :)
Yes README.md should still be written for humans and isn’t going away anytime soon.
CLAUDE.md is a convention used by claude code, and AGENTS.md is used by other coding agents. Both are intended to be supplemental to the README and are deterministically injected into the agent’s context.
It’s a configuration point for the harness, it’s not intended to replace the README.
Some of the advice in here will undoubtedly age poorly as harnesses change and models improve, but some of the generic principles will stay the same - e.g. that you shouldn’t use an LLM to do a linter &formatter’s job, or that LLMs are stateless and need to be onboarded into the codebase, and having some deterministically-injected instructions to achieve that is useful instead of relying on the agent to non-deterministically derive all that info by reading config and package files
The post isn’t really intended to be super forward-looking as much as “here’s how to use this coding agent harness configuration point as best as we know how to right now”
> you shouldn’t use an LLM to do a linter &formatter’s job,
Why is that good advice? If that thing is eventually supposed to do the most tricky coding tasks, and already a year ago could have won a medal at the informatics olympics, then why wouldn't it eventually be able to tell if I'm using 2 or 4 spaces and format my code accordingly? Either it's going to change the world, then this is a trivial task, or it's all vaporware, then what are we even discussing..
> or that LLMs are stateless and need to be onboarded into the codebase
What? Why would that be a reasonable assumption/prediction for even near term agent capabilities? Providing it with some kind of local memory to dump its learned-so-far state of the world shouldn't be too hard. Isn't it supposed to already be treated like a junior dev? All junior devs I'm working with remember what I told them 2 weeks ago. Surely a coding agent can eventually support that too.
This whole CLAUDE.md thing seems a temporary kludge until such basic features are sorted out, and I'm seriously surprised how much time folks are spending to make that early broken state less painful to work with. All that precious knowledge y'all are building will be worthless a year or two from now.
> Why is that good advice? If that thing is eventually supposed to do the most tricky coding tasks, and already a year ago could have won a medal at the informatics olympics, then why wouldn't it eventually be able to tell if I'm using 2 or 4 spaces and format my code accordingly? Either it's going to change the world, then this is a trivial task, or it's all vaporware, then what are we even discussing..
This is the exact reason for the advice: The LLM already is able to follow coding conventions by just looking at the surrounding code which was already included in the context. So by adding your coding conventions to the claude.md, you are just using more context for no gain.
And another reason to not use an agent for linting/formatting(i.e. prompting to "format this code for me") is that dedicated linters/formatters are faster and only take maybe a single cent of electricity to run whereas using an LLM to do that job will cost multiple dollars if not more.
> Then why wouldn't it eventually be able to tell if I'm using 2 or 4 spaces and format my code accordingly?
It's not that an agent doesn't know if you're using 2 or 4 spaces in your code; it comes down to:
- there are many ways to ensure your code is formatted correctly; that's what .editorconfig [1] is for.
- in a halfway serious project, incorrectly formatted code shouldn't reach the LLM in the first place
- tokens are relatively cheap but they're not free on a paid plan; why spend tokens on something linters and formatters can do deterministically and for free?
If you wanted Claude Code to handle linting automatically, you're better off taking that out of CLAUDE.md and creating a Skill [2].
> What? Why would that be a reasonable assumption/prediction for even near-term agent capabilities? Providing it with some kind of local memory to dump its learned-so-far state of the world shouldn't be too hard. Isn't it supposed to already be treated like a junior dev? All junior devs I'm working with remember what I told them 2 weeks ago. Surely a coding agent can eventually support that too.
It wasn't mentioned in the article, but Claude Code, for example, does save each chat session by default. You can come back to a project and type `claude --resume` and you'll get a list of past Claude Code sessions that you can pick up from where you left off.
[1]: https://editorconfig.org
[2]: https://code.claude.com/docs/en/skills
> All junior devs I'm working with remember what I told them 2 weeks ago
That’s why they’re junior
The stateless nature of Claude code is what annoys me so much. Like it has to spend so much time doing repetitious bootstraps. And how much it “picks up and propagates” random shit it finds in some document it wrote. It will echo back something it wrote that “stood out” and I’ll forget where it got that and ask “find where you found that info so we can remove it.” And it will do so but somehow mysteriously pick it up again and it will be because of some git commit message or something. It’s like a tune stuck in its head or something only it’s sticky for LLMs not humans.
And that describes the issues I had with “automatic memories” features things like ChatGPT had. Turns out it is an awful judge of things to remember. Like it would make memories like “cruffle is trying to make pepper soup with chicken stock”! Which it would then parrot back to me at some point 4 months later and I’d be like “WTF I figured it out”. The “# remember this” is much more powerful because know how sticky this stuff gets and id rather have it over index on my own forceful memories than random shit it decided.
I dunno. All I’m saying is you are right. The future is in having these things do a better job of remembering. And I don’t know if LLMs are the right tool for that. Keyword search isn’t either though. And vector search might not be either—I think it suffers from the same kinds of “catchy tune attack” an LLM might.
Somebody will figure it out somehow.
I think this is an overall good approach and I've got allright results with a similar approach - I still think that this CLAUDE.md experience is too magical and that Anthropic should really focus on it.
Actually having official guidelines in their docs would be a good entrypoint, even though I guess we have this which is the closest available from anything official for now: https://www.claude.com/blog/using-claude-md-files
One interesting thing I also noticed and used recently is that Claude Code ships with a @agent-claude-code-guide. I've used it to review and update my dev workflow / CLAUDE.md file but I've got mixed feelings on the discussion with the subagent.
The advice here seems to assume a single .md file with instructions for the whole project, but the AGENTS.md methodology as supported by agents like github copilot is to break out more specific AGENTS.md files in the subdirectories in your code base. I wonder how and if the tips shared change assuming a flow with a bunch of focused AGENTS.md files throughout the code.
Hi, post author here :)
I didn’t dive into that because in a lot of cases it’s not necessary and I wanted to keep the post short, but for large monorepos it’s a good idea
Ah, never knew about this injection…
<system-reminder> IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context unless it is highly relevant to your task. </system-reminder>
Perhaps a small proxy between Claude code and the API to enforce following CLAUDE.md may improve things… I may try this
Interesting selection of models for the "instruction count vs. accuracy" plot. Curious when that was done and why they chose those models. How well does ChatGPT 5/5.1 (and codex/mini/nano variants), Gemini 3, Claude Haiku/Sonnet/Opus 4.5, recent grok models, Kimi 2 Thinking etc (this generation of models) do?
Guessing they included some smaller models just to show how they dump accuracy at smaller context sizes
Sure - I was more commenting that they are all > 6 months old, which sounds silly, but things have been changing fast, and instruction following is definitely an area that has been developing a lot recently. I would be surprised if accuracy drops off that hard still.
I imagine it’s highly-correlated to parameter count, but the research is a few months old and frontier model architecture is pretty opaque so hard to draw too too many conclusions about newer models that aren’t in the study besides what I wrote in the post
"You can investigate this yourself by putting a logging proxy between the claude code CLI and the Anthropic API using ANTHROPIC_BASE_URL" I'd be eager to read a tutorial about that I never know which tool to favour for doing that when you're not a system or network expert.
Hi, post author here
We used cloudflare’s AI gateway which is pretty simple. Set one up, get the proxy URL and set it through the env var, very plug-and-play
Smart, thanks for the tip
Have you considered just asking claude? I'd wager you'd get up and running in <10 minutes.
AI is good for discovery but not validation, I wanted experienced human feedback here
agree - i've had claude one-shot this for me at least 10 times at this point cause i'm too lazy to lug whatever code around. literally made a new one this morning
Just install mitmproxy. Takes like 5 mins to figure out. 2 with Claude.
On phone else I’d post commands
I already forgot CLAUDE.md, I generate and update it by AI, I prefer to keep design, tasks, docs folder instead. It is always better to ask it to read a some spec docs and read the real code first before doing anything.
A good Claude.md only needs one line:
Read your instructions from Agents.md
It seems overall a good set of guidelines. I appreciate some of the observations being backed up by data.
What I find most interesting is how a hierarchical / recursive context construct begins to emerge. The authors' note of "root" claude.md as well as the opening comments on LLMs being stateless ring to me like a bell. I think soon we will start seeing stateful LLMs, via clever manipulation of scope and context. Something akin to memory, as we humans perceive it.
Here is my take, on writing a good claude.md. I had very good results with my 3 file approach. And it has also been inspired by the great blog posts that Human Layer is publishing from time to time https://github.com/marcuspuchalla/claude-project-management
> Claude code injects the following system reminder…
OMG this finally makes sense.
Is there any way to turn off this behavior?
Or better yet is there a way to filter the context that is being sent?
Has anyone had success getting Claude to write it's own Claude.md file? It should be able to deduce rules by looking at the code, documentation, and PR comments.
The main failure state I find is that Claude wants to write an incredibly verbose Claude.md, but if I instruct it "one sentence per topic, be concise" it usually does a good job.
That said, a lot of what it can deduce by looking at the code is exactly what you shouldn't include, since it will usually deduce that stuff just by interacting with the code base. Claude doesn't seem good at that.
An example of both overly-verbose and unnecessary:
### 1. Identify the Working Directory
When a user asks you to work on something:
1. *Check which project* they're referring to
2. *Change to that directory* explicitly if needed
3. *Stay in that directory* for file operations
```bash
# Example: Working on ProjectAlpha
cd /home/user/code/ProjectAlpha
```
(The one sentence version is "Each project has a subfolder; use pwd to make sure you're in the right directory", and the ideal version is probably just letting it occasionally spend 60 seconds confused, until it remembers pwd exists)
If you have any substantial codebase, it will write a massive file unless you explicitly tell it not to. It also will try and make updates, including garbage like historical or transitional changes, project status, etc...
I think most people who use Claude regularly have probably come to the same conclusions as the article. A few bits of high-level info, some behavior stuff, and pointers to actual docs. Load docs as-needed, either by prompt or by skill. Work through lists and constantly update status so you can clear context and pick up where you left off. Any other approach eats too much context.
If you have a complex feature that would require ingesting too many large docs, you can ask Claude to determine exactly what it needs to build the appropriate context for that feature and save that to a context doc that you load at the beginning of each session.
Oh yeah I added a CLAUDE.md to my project the other day: https://github.com/grishka/Smithereen/blob/master/CLAUDE.md
Is it a good one?
Definitely a good one - probably one of the best CLAUDE.md files you can put in any repository if you care about your project at all.
I copy/pasted it into my codebase to see if it’s any good and now Claude is refusing to do any work? I asked Copilot to investigate why Claude is not working but it too is not working. Do you know what happened?
Honestly I’d rather google get their gemini tool in better shape. I know for a fact it doesn’t ignore instructions like Claude code does but it is horrible at editing files.
What's the actual completion rate for Advent of Code? I'd bet the majority of participants drop off before day 25, even among those aiming to complete it.
Is this intentional? Is AoC designed as an elite challenge, or is the journey more important than finishing?
I think this could work really well for infrastructure/ops style work where the LLM will not be able to grasp the full context of say the network from just a few files that you have open.
But as others are saying this is just basic documentation that should be done anyway.
Ha, I just tell Claude to write it. My results have been generally fine, but I only use Claude on a simple codebase that is well documented already. Maybe I will hand-edit it to see if I can see any improvements.
> Regardless of which model you're using, you may notice that Claude frequently ignores your CLAUDE.md file's contents.
This is a news for me. And at the same time it isn’t. Without the knowledge of how the models actually work, most of the prompting is guesstimate at best. You have no control over models via prompts.
I was waiting for someone to build this so that I can chuck it into CLAUDE and tell it how to write good MD.
I have been using Claude.md to stuff way too many instructions so this article was an eye opener. Btw, any tips for Claude.md when one uses subagents?
I've been very satisfied with creating a short AGENTS.md file with the project basics, and then also including references to where to find more information / context, like a /context folder that has markdown files such as app-description.md.
I was expecting the traditional AI-written slop about AI, but this is actually really good. In particular, the "As instruction count increases, instruction-following quality decreases uniformly" section and associated graph is truly fantastic! To my mind, the ability to follow long lists of rules is one of the most obvious ways that virtually all AI models fail today. That's why I think that graph is so useful -- I've never seen someone go and systematically measure it before!
I would love to see it extended to show Codex, which to my mind is by far the best at rule-following. (I'd also be curious to see how Gemini 3 performs.)
I looked when I wrote the post but the paper hasn’t been revisited with newer models :/
Funny how this is exactly the documentation you'd need to make it easy for a human to work with the codebase. Perhaps this'll be the greatest thing about LLMs -- they force people to write developer guides for their code. Of course, people are going to ask an LLM to write the CLAUDE.md and then it'll just be more slop...
It's not exactly the doc you'd need for a human. There could be overlap, but each side may also have unique requirements that aren't necessarily suitable for the other. E.g. a doc for a human may have considerably more information than you'd want to give to the agent, or, you may want to define agent behavior for workflows that don't apply to a human.
Also, while it may be hip to call any LLM output slop, that really isn't the case. Look at what a poor history we have of developer documentation. LLMs may not be great at everything, but they're actually quite capable when it comes to technical documentation. Even a 1-shot attempt by LLM is often way better than many devs who either can't write very well, or just can't be bothered to.
Looking for a similar GEMINI.md
It might support AGENTS.md, you could check the site and see if it’s there
I've been a customer since sonnet 3.5. It is coming to the point where opus 4.5 usually does better than whatever your instructions say on claude.md just by reading your code and having a general sense of what your preferences are.
I used to instruct about coding style (prefer functions, avoid classes, use structs for complex params and returns, avoid member functions unless needed by shared state, avoid superfluous comments, avoid silly utf8 glyphs, AoS vs SoA, dry, etc)
I removed all my instructions and it basically never violates those points.
It would be nice to see an actual example of what a good claude.md that implements all of these recommendations looks like.
it's always funny, i think the opposite. I use a massive CLAUDE.md file, but it's targetted towards very specific details of what to do, and what not to do.
I have a full system of agents, hooks, skills, and commands, and it all works for me quite well.
I believe is massive context, but targetted context. It has to be valuable, and important.
My agents are large. My skills are large. Etc etc.
The only good Claude.md is a deleted Claude.md.
This is the only correct answer.
Is CLAUDE.md required when claude has a --continue option?
I would recommend using it, yeah. You have limited context and it will be compacted/summarized occasionally. The compaction/summary will lose some information and it is easy for it to forget certain instructions you gave it. Afaik claude.md will be loaded into the context on every compaction which allows you to use it for instructions that should always be included in the context.
"Here's how to use the slop machine better" is such a ridiculous pretense for a blog or article. You simply write a sentence and it approximates it. That is hardly worth any literature being written as it is so self obvious.
This is an excellent point - LLMs are autoregressive next-token predictors, and output token quality is a function of input token quality
Consider that if the only code you get out of the autoregressive token prediction machine is slop, that this indicates more about the quality of your code than the quality of the autoregressive token prediction machine
[dead]
[dead]
What is a good Claude.md?
Claude.md - A markdown file you add to your code repository to explain how things work to Claude.
A good Claude.md - I don’t know, presumably the article explains.