So my theory is that with the help of telemetry or something else, AI can learn from data stored on users’ computers, meaning AI can steal your completed work, as well as your edits and corrections to your work, etc., even offline if you’re a Windows user for example.
In short, AI will be able to learn from you even when you edit your articles, edit your drawings, improve your music, etc. In other words, AI will literally steal your soul.
What do you think about it?
I mean, yes. That’s what we’re all kind of conjecturing on at the moment.
I want to warn against conspiratorial thinking because I’m always trying to avoid falling into it myself. We won’t ever really probably have direct proof of a lot of this unless there are credibly authenticated insider leaks from Microsoft.
But we can analyze evidence. Things like seeing that the Terms of Service have been updated for so many apps and online platforms to automatically opt-in your work for AI training purposes. The fact that Microsoft is forcing online accounts. The fact that Microsoft is defaulting save locations to OneDrive (https://www.zdnet.com/article/microsoft-word-forcing-you-to-save-new-files-to-the-cloud-heres-how-to-stop-it/) and that they are training AI on pictures you’ve saved there and you can only opt out 3 times a year??? (https://www.windowscentral.com/microsoft/onedrives-ai-face-scanning-feature-suggests-it-can-only-be-disabled-3-times-a-year-but-that-doesnt-seem-right)
Telemetry is already pretty opaque; it’s hard to tell what data they are wrapping up in it and if I understand what you’re saying correctly, yes. It’s very easy to keep that bottled until the next time the computer comes online so that even your “offline” activities are monitored. Look at Recall or even before that, Activity History was already giving me the heebie-jeebies (https://support.microsoft.com/en-us/account-billing/what-is-the-recent-activity-page-23cf5556-4dbe-70da-82c8-bb3a8d8f8016)
They already consumed the entire internet, probably some time ago it’s been theorized. In order to keep chasing those diminishing returns they are incentivized to chase every new data source because that scales into profit (no it doesn’t, that’s the bubble, but that’s a whole other conversation).
It’s pretty bleak. Use Linux, etc.
You can strip AI out of this post and nothing changes. Granting various things access to your various systems/works has and will do things like this.
people can learn from it with lots of effort, if they get access to the data. but it’s not so much effort (time) for an AI company (for a good enough quality), and since microsoft collects it it does not only affect what you willingly publish, but virtually anything on your computer
AI will literally steal your soul.
You don’t have a soul to steal. And even if you did, if it was comprised of the documents you have on your computer, it would be the saddest soul to ever have existed. I pity you if this is a thing you seriously worry about, since it must surely mean your life is meaningless, drab, and without warmth of any kind.
That’s no theory, that’s reality
…No.
The way “AI” is set up now is as big blocks of “weights” and can run in two modes:
-
Inference: Running a model to generate some kind of prediction from output, e.g. the next word in a block of text for an LLM. Typically, this not so hard and ‘batched,’ e.g. a single GPU may serve 16 people at once, in parallel (though I am skipping many intricacies here).
-
Training: Taking a bunch of data (like big blocks of specifically formatted, processed text for LLMs), and altering the weights to fit it with glorified linear regression. This is typically done on TONS of tightly networked GPUs in more specialized setups. Usually 8x big ones in one server, at a minimum. And all the data selection/formatting is done by humans, by hand, though sometimes enhanced with algorithms that, say, generate a thinking trace. There’s also a distinction between ‘pretraining’ (making the initial model) and ‘finetuning’ (slightly altering it with new data, which is tricky).
Point I’m making is: you cannot do both at once.
You can ‘infer’ AI, but internally, it will never change. Its never learns.
You can train it with new data, but this is a huge and very manual/finicky and infrequent endeavor. And it takes a long time.
Theoretically ‘learning on the go’ is a goal of machine learning, but right now Big Tech is acting about as innovative as a brick, and just scaling up architectures and stoking egos rather than paying attention to this kind of research. The Chinese LLM companies are also being relatively conservative too, but in a different way.
Also, the recent fad (especially in China) is to make and use synthetic data instead, eg data some AI made up all by itself. This (in combination with smaller amounts of really clean/high quality ‘real’ data, e.g. not some random files stolen from your computer) is actually quite effective.
If you’re worried about privacy, the advertising ‘models’ companies like Facebook and Google already make, and have been making for over a decade, are closer to what you describe. It’s already happened, and we’ve been living with it for years.
Some of that is oldschool machine learning, but not all of it.
-
There’s always pen and paper.
All except this line is happening.
AI will literally steal your soul
And that can’t happen because it can’t steal what doesn’t exist.
So my theory is that with the help of telemetry or something else, AI can learn from data stored on users’ computers, meaning AI can steal your completed work, as well as your edits and corrections to your work, etc., even offline if you’re a Windows user for example.
No. It can’t do that.
You mean like copilot right now ?
Please explain how a PC not connected to anything online is able to steal your documents.
Bonus if you use "“telemetry” to explain something about offline PCs.
I’m not even following why this is an argument , but personal experience …. About 15 years ago the company I was at added telemetry to its windows product. This was legit as we were looking to stamp out some pesky bugs we had never been able to reproduce, no personal data collected. But the basic implementation was to write collected data to disk, then there was a completely separate service whose only job was uploading that data.
The point is the basic model supported collecting telemetry data even while offline and it would get uploaded ifyou ever were online
then there was a completely separate service whose only job was uploading that data.
And it uploaded that data…
Thru the Internet?
The point is the basic model supported collecting telemetry data even while offline and it would get uploaded ifyou ever were online
That hypothetical would involve a massive upload, especially since OP is talking every edit and not just documents.
Like, type a 300 word essay, it would need to send exponentially huge amounts of documents because it’s a new edit every time a letter is typed/erased.
Like, it’s a ridiculous scenario.
We don’t need those, there are plenty real world.problems with AI.
Only if you allow copilot on your device.
Updating (or buying a new computer) force feed copilot like edge to your system, does it not ?
I mean that is pretty much what AI bros want to do… and/or maybe already doing
From a researcher/developer perspective: the biggest bottleneck that affects current-gen AI is the lack of high quality training data; the more high-quality (a.k.a. human-generated and not complete shitposts) training data, the better. What people write on their computers would probably overwhelmingly be high quality. That means, without major technological advancements… if AI companies have access to the types of contents you just described, it is very much in their interests to use them
I don’t 100% agree with this view, but if you subscribe to Prof. Emily M. Bender’s thought of seeing AI models as plagiarism machines, maybe you can say that AI is “stealing your soul”
From a researcher/developer perspective: the biggest bottleneck that affects current-gen AI is the lack of high quality training data
I don’t completely agree with this. Recent papers have been working miracles with synthetic data generation and smaller datasets (eg, Phi).
Meanwhile, there’s a lot of speculation that Llama4 failed because Meta’s ‘real’ data was vast but not ‘smart,’ with hints via lines like this:
In order to maximize performance, we had to prune 95% of the SFT data, as opposed to 50% for smaller models, to achieve the necessary focus on quality and efficiency.
Whereas Deepseek, with a very similar architecture and size, wrote about how well synthetic data worked in their GRPO paper.
And this keeps happening. As an example, Kimi Linear is (subjectively) performing very well in spite of its ‘small’ training dataset: https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
IMO the limiting factor seems to be GPU time, dev time, and willingness to ‘experiment’ with exotic architectures, optimizations, and more specialized models (including the burnt time/cash on experiments that don’t work).
That is, fundamentally, what some of us figure the long term plan is with Microsoft Recall.
It came with various guarantees of privacy, the first time they tried it.
But they know no one reads changes to terms of service.
The sad part is that I fully expect that to be the default reality in a few years: a Microsoft model training on every keystroke and click on every copy of Windows 11/12.
Did you try asking AI?
No.
I just tried to draw logical conclusions
What you’re really describing sounds like a deeper fear: that AI might absorb your creativity - your decisions, your refinements, your “style” - without permission. That’s a valid and serious cultural concern.
If models are trained on massive amounts of human creative work (often scraped from the web), then yes - society faces a collective version of this “soul stealing,” where human creativity feeds a machine that imitates it. The ethical debate is still ongoing, and new laws and technical standards are emerging to address it (e.g., data provenance, opt-out tags, content authenticity).



