• 17 Posts
  • 1.23K Comments
Joined 2 years ago
cake
Cake day: March 22nd, 2024

help-circle

  • From a researcher/developer perspective: the biggest bottleneck that affects current-gen AI is the lack of high quality training data

    I don’t completely agree with this. Recent papers have been working miracles with synthetic data generation and smaller datasets (eg, Phi).

    Meanwhile, there’s a lot of speculation that Llama4 failed because Meta’s ‘real’ data was vast but not ‘smart,’ with hints via lines like this:

    In order to maximize performance, we had to prune 95% of the SFT data, as opposed to 50% for smaller models, to achieve the necessary focus on quality and efficiency.

    Whereas Deepseek, with a very similar architecture and size, wrote about how well synthetic data worked in their GRPO paper.

    And this keeps happening. As an example, Kimi Linear is (subjectively) performing very well in spite of its ‘small’ training dataset: https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct


    IMO the limiting factor seems to be GPU time, dev time, and willingness to ‘experiment’ with exotic architectures, optimizations, and more specialized models (including the burnt time/cash on experiments that don’t work).


  • …No.

    The way “AI” is set up now is as big blocks of “weights” and can run in two modes:

    • Inference: Running a model to generate some kind of prediction from output, e.g. the next word in a block of text for an LLM. Typically, this not so hard and ‘batched,’ e.g. a single GPU may serve 16 people at once, in parallel (though I am skipping many intricacies here).

    • Training: Taking a bunch of data (like big blocks of specifically formatted, processed text for LLMs), and altering the weights to fit it with glorified linear regression. This is typically done on TONS of tightly networked GPUs in more specialized setups. Usually 8x big ones in one server, at a minimum. And all the data selection/formatting is done by humans, by hand, though sometimes enhanced with algorithms that, say, generate a thinking trace. There’s also a distinction between ‘pretraining’ (making the initial model) and ‘finetuning’ (slightly altering it with new data, which is tricky).


    Point I’m making is: you cannot do both at once.

    You can ‘infer’ AI, but internally, it will never change. Its never learns.

    You can train it with new data, but this is a huge and very manual/finicky and infrequent endeavor. And it takes a long time.

    Theoretically ‘learning on the go’ is a goal of machine learning, but right now Big Tech is acting about as innovative as a brick, and just scaling up architectures and stoking egos rather than paying attention to this kind of research. The Chinese LLM companies are also being relatively conservative too, but in a different way.

    Also, the recent fad (especially in China) is to make and use synthetic data instead, eg data some AI made up all by itself. This (in combination with smaller amounts of really clean/high quality ‘real’ data, e.g. not some random files stolen from your computer) is actually quite effective.


    If you’re worried about privacy, the advertising ‘models’ companies like Facebook and Google already make, and have been making for over a decade, are closer to what you describe. It’s already happened, and we’ve been living with it for years.

    Some of that is oldschool machine learning, but not all of it.







  • Yeah, that sounds dreamy. It could certainly work.

    And yeah, the problem is not just Microsoft but Mojang. Mojang is an extremely conservative/careful dev, even before they got bought by MS. It’s why the game hasn’t enshittified too bad, but also why development seems to move so slow for arguably the biggest game on Earth.

    Collaborating via a repo like that would be… a lot.

    Again, it’d be awesome and I think it would work, but it would be a massive step even if Microsoft wasn’t in the picture.





  • This is horrifying. Like, mind-bogglingly bad.

    It’s also a blatant lie. Other countries are not testing nuclear weapons, they’re running simulations. We didn’t do that in the 60s because it wasn’t possible, but the US (and presumably China/Russia) literally have giant supercomputers for this purpose now.

    …And we would know if they were doing real testing because it would show up on seismographs.


    But who cares about truth? Or sanity? No, all my relatives (even scientifically minded ones) won’t even bat an eye, lest some Democrat steal their retirement, ugh.








  • This is quintessential “Modern CPP”

    Take a real problem screwing up the western world bad (like influencer mis/disinformation), and smash it in a way only their massive state apparatus can…

    Superficially.

    It’s “proof” their party line works and, as always, a good way to control the populace, if abused. It’s probably effective, but not as effective as it appears on the surface.


    I’m sympathetic here.

    In past years I was a “free internet” libertarian leaning diehard, but something has to be done about algos boosting shameless outrage peddlers; it’s literally destroying the planet and our collective psyche, just for short term corporate benefit (Or corpo-state benefit in China’s case, as its “Big Tech” is under the party’s thumb). But China just took the problem and used it as an excuse for more control.