@monotremata

monotremata@lemmy.ca · 1 month ago

Yeah, this is what I was going to call out. Calling it “100% solvable by humans” and saying “if human scores were included, they would be at 100%” when 20-60% of humans solved each task seems kinda misleading. The AI scores are so low that I don’t think this kind of hyperbole is necessary; I assume there are some humans that scored 100%, but I would find it a lot more useful if they said something like “the worst-performing human in our sample was able to solve 45% of the tasks” or whatever. Given that the AIs are still scoring below 1%, that’s still pretty dark.

monotremata@lemmy.ca · 1 month ago

username checks out

monotremata@lemmy.ca · 1 month ago

Yeah, agreed. I must have misunderstood your original comment.

monotremata@lemmy.ca · 1 month ago

I’m not sure I totally understand your comment, so bear with me if I’m agreeing with you and just not understanding that.

“let me prioritize PRs raised by humans” … but why? Why do that in the first place? If bots/LLMs/agents/GenAI genuinely worked they would not care if it was made or not by humans, it would just be quality submission to share.

Before LLMs, there was a kind of symmetry about pull requests. You could tell at a glance how much effort someone had put into creating the PR. High effort didn’t guarantee that the PR was high quality, but you could be sure you wouldn’t have to review a huge number of worthless PRs simply because the work required to make something that even looked plausibly decent was too much for it to be worth doing unless you were serious about the project.

Now, however, that’s changed. Anyone can create something that looks, at first glance, like it might be an actual bug fix, feature implementation, etc. just by having the LLM spit something out. It’s like the old adage about arguing online–the effort required to refute bullshit is exponentially higher than the effort required to generate it. So now you don’t need to be serious about advancing a project to create a plausible-looking PR. And that means that you can get PRs coming from people who are just trolls, people who have no interest in the project but just want to improve their ranking on github so they look better to potential employers, people who build competing closed-source projects and want to waste the time of the developers of open-source alternatives, people who want to sneak subtle backdoors into various projects (this was always a risk but used to require an unusual degree of resources, and now anyone can spam attempts to a bunch of projects), etc. And there’s no obvious way to tell all these things apart; you just have to do a code review, and that’s extremely labor-intensive.

So yeah, even if the LLMs were good enough to produce terrific code when well-guided, you wouldn’t be able to discern exactly what they’d been instructed to make the code do, and it could still be a big problem.

monotremata@lemmy.ca · 2 months ago

But then you’ve got a space that’s 5’ 7 3/8" and you need a clearance of 7/32" on each end, so your piece should be…uh… 5’ 6 15/16" long. So much easier than metric, right?

In metric it would be 1711mm (or 1.711m) and you’d need to take 5.5mm off each end, so it’s 1700mm. (For the record, I picked random numbers in imperial and only did the metric conversion afterwards, I just lucked into the nice round number here.)

I dunno. You need how many sig figs you need in whichever system, but switching between a factor of 12 for the feet, base 10 for the inches, and the equivalent of binary decimals for the partial inches sure does take getting used to. I’ve finally gotten used to it enough that I can do it in my head, but I prefer to work on metric for most things.

I acknowledge that machinists just use thousandths of an inch, which does greatly improve working with that system, but it also introduces a third kind of measurement that can’t easily be interconverted with the other two. I dunno. It just feels like we’re doing way too much work propping up this archaic system when literally everyone else in the world is using something simpler and we could just be on the same system.

monotremata@lemmy.ca · 6 months ago

If you haven’t already, check out Ludwig.

monotremata@lemmy.ca · 6 months ago

I mean, arguably this was done years ago with Return to Zork, Zork: Nemesis, and Zork: Grand Inquisitor. They shared a bit of the humor of the originals, but they were still pretty different.

monotremata@lemmy.ca · 6 months ago

https://www.youtube.com/watch?v=4nigRT2KmCE

monotremata@lemmy.ca · 6 months ago

Good questions. I don’t know, and I can no longer try to find out, as the mods have now removed the comment. (Sorry for the double-post–I got briefly confused about which comment you were referring to and deleted my first post, then realized I’d been frazzled and the post in question really was deleted by the mods.)

monotremata@lemmy.ca · 6 months ago

deleted by creator

monotremata@lemmy.ca · 6 months ago

Basically this: https://www.psychdb.com/cognitive-testing/clock-drawing-test

monotremata@lemmy.ca · 6 months ago

And then there’s the federal minimum wage for tipped workers, which is a paltry $2.13/hr. If the tips don’t bring that up to at least match the normal minimum (the $7.25/hr figure) then employers are supposed to pay them more to make up the difference, but they basically never do. Tip theft is also super common, including with some of the online ordering apps.

monotremata@lemmy.ca · 7 months ago

*Agnew

monotremata@lemmy.ca · 7 months ago

Musk is currently trying to negotiate a pay package from Tesla, which, if approved would pay him roughly a trillion dollars over the next decade. It’s into the “not really comprehensible by humans” range.

https://arstechnica.com/tech-policy/2025/10/musks-1-trillion-pay-plan-doesnt-force-him-to-keep-focus-on-tesla-critics-say/

monotremata@lemmy.ca · 7 months ago

Doing it on a weekday means it can be more disruptive to the status quo, but that also means more divisive. Doing it on the weekend means it’s easier for more people to participate and it can come across as more of a statement of unity.

I don’t think one is strictly better than the other. I do think this one will have an impact.

monotremata@lemmy.ca · 7 months ago

My cat’s name is Clark, which gradually sometimes became Clark-a-doodle, which became Doodle, which became Doodlebug, which became Bug and occasionally Buggle.

When I’m frustrated with him, it’s now sometimes Frettled Gruntbuggly.

monotremata@lemmy.ca · edit-2 7 months ago

And his last name is “maligno.” Even Disney wouldn’t use a villain name that blatant. Is reality just a really bad fanfic at this point?

monotremata@lemmy.ca · 7 months ago

As an American, I can confirm, it’s fucking grim here right now.

monotremata@lemmy.ca · 7 months ago

Ritalin isn’t methamphetamine, but Desoxyn is, and that’s also used for ADHD.

monotremata@lemmy.ca · 7 months ago

🎶 The dream of the 90’s is alive in Linux🎶