Wow that's really cool i'll definitely check it out!
have played around with machine learning algorithms built from scratch in c / cuda too, but once i hit the cuda part of it i kinda just left it to the side.
i'm curious how did you use CUDA to optimize the matrix multiplications?
how optimized is training, does it take much longer then using pytorch?
> "no one had the original detailed design documents"
Is this normal for a comany like Boeing? It's not my area, but I would have guessed they would keep the documentation for everything on a product as important as a 747.
“Relevant to anyone building failure-attribution systems for agent pipelines — black-box distillation techniques here could feed into causal attribution models without needing white-box access to the underlying model.”
1000% this, this was us internally testing if our harness worked, the motivation was never to test them in-depth 1v1. We were just really shocked at the results, there’s a lot more work to do here.
Hello author here, or one of them anyway. I can confirm that it was hand written, 32% was combined all the Claude models (4.6, 4.7, 4.8) mushed into one score, 37% was Opus 4.6 specifically (which did the best)
I really dont understand this constant changing of numbers. I have tried a bunch of ATS reviewers and everytime on the same resume i get different numbers. Its weird and unreliable. I understand the need for doing this to filter through thousands of CVs but maybe there is a better way. Like a take home test at the beginning or a test of somekind.
Same. I haven't been able to see how people let agent loops run without significant steering and produce good quality software. VS Code with one or two integrated terminals running is fine for me. Or a couple of VS Code instances if I'm working one a couple of projects concurrently. The advantage of VS Code is the code / diff visibility if you like to be hands on.
Ha — you actually did it. Genuinely made my day; you've now read it more recently than I have. And I think you've landed where I did: the concept-mapping part is the keeper and the rest I have not put to practical use.
Hello! Author here (Katie) Ty for your comments, 4.6 and 4.7 both scored 28% on our benchmark, I just wanted to have 10 things in the list because I wanted a round number.
I'm intrigued by this idea, and plan to test it to build a new product that is a sibling of an existing one, but with a different targeted purpose. I believe it's verifiable enough to try a goal oriented approach but I'm slightly nervous about this all just being a way to get us to burn the next order of magnitude in tokens!
Curious how you're handling supervision after the switch. The remoteproc unit you converted by hand is the part that gets interesting: when it crashes, what restarts it, and what guarantees it came up only after its dependencies were actually ready rather than just launched? That dependency ordering and the cgroup resource accounting are the real work, and OpenRC hands both back to you.
Quest Global reflects how engineering-led firms are gaining attention in India’s IPO landscape, where long-term fundamentals matter more than listing numbers.
Yeah and the richest person on the planet with full vertical integration (with partner) Elon musk and Tesla aka solar roof and power wall wants to sell DC in space and hasn't fixed any use of gas turbine for Colossus 1&2.
Without enforcement it will happen continuously with snail pace
It's a touchy subject because scientists are afraid of looking like cranks. Just like cognition in other nonhuman species was historically dismissed because no one wants to be seen as someone who think plants talk to each other
The thing that most disconcerts me isn't the runtime pruning, it's the cold loading. Months ago, I added a few skills and MCPs to test them, partly in the frenzy of free shopping, but then I forgot about half of them.
So after I got tired of choosing by hand, and therefore also a bit blindly, I created a small tool that runs locally and analyzes conversations to tell you which skills, MCPs, or other things are always unused.
347 items never used · ~19354 dead tokens/session · ~$25.49/month A lot of ECC that I never used but always loaded.
If anyone's interested, I've put it on GitHub, thousandflowers/skillreaper.
A necessary condition for the emergence of civilization (in its broadest sense, as in the collectively organized reshaping of one's living environment for one's purposes) is a basic level of trust and cooperation.
Humans are not naturally prone to bouts of violence like other species and human societies do not tolerate impulsive violence in adults. Instead, the vast majority of human-on-human violence is deliberate: people plan, carefully and rationally, the killing of their fellow human beings in order to achieve their goals. This is known as 'war'. This is very rare among species, and more violent species cannot form complex societies at scale, despite being also quite smart in some cases (chimps, octopuses).
We even unconsciously internalized this idea to some degree, since most people are comfortable with the idea of militaries existing and being necessary but also agree that solitary murderous psychopaths should be put in prison.