Data accidentally exposed by Microsoft AI researchers




What's that, the second major data loss / leak event from MSFT recently.

Is your data really safe there?


It's not reasonable to expect human security token generation to be perfectly secure all the time. The system needs to be safe overall. The organization should have set an OrgPolicy on this entire project to prevent blanket sharing of auth tokens/credentials like this. Ideally blanket access tokens should be opt-in, not opt-out.

Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.


Straight to jail.


Nah, Microsoft probably has a blameless culture


It was hackers, for sure.


This is very similar to how some security researchers got access to TikTok's S3 bucket:

They used the same mechanism of using common crawl or other publicly available web crawler data to source dns records for s3 buckets.


Looks like it was up for 2 years with that old link[1]. Fixed two months ago.





If only Microsoft hadn’t named the project “robust” models transfer, they could have dodged this Hubrisbleed attack.




Kind of incredible that someone managed to export Teams messages out from Teams…


how is this sort of stuff not at least encrypted at rest?


What do you think "encryption at rest" means


Encryption at rest does nothing to prevent online access to data. It's only useful if you leave your storage cabinet standing on the side of the road.


Your laptop backup could be encrypted. New problem: where to out the keys. Maybe another storage account with different access controls.


> New problem: where to out the keys.

If it's windows, Active Directory.


Per the article, the Azure bucket was explicitly shared. Azure Storage is generally encrypted at rest (


I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.


> if you could have unlimited, named keys for each container.

These exist and are called Shared Access Tokens. People are too lazy to use them and just use the account-wide keys instead.


> I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

Actually there is a better way. Look into “Managed Identity”. This allows you to grant access from one service to another, for example grant access to allow a specific VM to work with your storage account.


This is what we are using for everything. It makes life so much easier.

So far, our new Azure tenant has absolutely zero passwords or shared secrets to keep track of.

Granting a function app access to SQL Server by way of the app's name felt like some kind of BS magic trick to me at first. But it absolutely works. Experiences like this give me hope for the future.


Would be insane if the GPT4 model is in there somewhere (as its served by Azure).


Also imagine all such exposed data sources including those that are not yet discovered... are crawled and trained on by GPT5.

Meanwhile a big enterprise provider like MS suffers a bigger leak and exposes MS Teams/ OneDrive / SharePoint data of all its North America customers say.

Boom we have GPT model that can autonomously run whole businesses.


Well there is that "transformers" folder at the bottom of the screenshot...


It's always funny that wiz's big security revelations are almost always about Microsoft. When wiz's founder was the highest ranking in charge of cyber security at Microsoft in his previous job .


Would be kind of surprising if that weren't the case.


On a lighter note - I saw a chat message that started with "Hey dude! How is it going". I'm disappointed that the response was not


This is quite funny for me because at first I didn't understand what the problem is.

In German, if you ask this question, it is expected that your question is genuine and you can expect an answer (Although usually people don't use this opportunity to unload there emotional package, but it can happen!)

Whereas in Englisch you assume this is just a hello and nothing more.


In England people say "You all right" and move on without even waiting for a response!


In America it's even worse because they say "What's up?" in the same way we Brits say "Alright?", but "What's up?" to me like the person has detected something wrong with you and wants to know what the problem is. At least "Alright?" is more generally asking for your status.

Of course, both are generally rhetorical, which must be confusing for some foreigners learning English, especially with the correct response to "Alright?" being "Alright?" and similarly with "What's up?".


I believe the correct response is "Chicken butt," but maybe I'm in very exclusive company in responding that way.


I love that an entire website was made around this, without any attempt to sell me anything. So rare to see that these days


Glad I've never had to deal with that in chat.

Though I have had the equivalent in tech support: "App doesn't work" which is basically just hello, obviously you're having an issue otherwise you wouldn't have contacted our support.


Unfortunately, the AI researcher did not use a LLM to automatically respond the nohello content.


Destroying comradery with a co-worker - Any % (WR)


I strongly support the “no hello” concept but I also fear being seen as “that guy” so I never mention it. Sigh


Be that guy. In the long run it's better to be right then popular.


But then I might not survive the long run.


I have seen people never ask their question after multiple days of saying "hello @user", despite having nohello as a status. And despite having asked them in the past to just ask their question and I'll respond when I can.

You just can't win.


I'd count that as a win. You avoided wasting your time answering a potentially inane question. If it were important, they would have asked.


I've made peace with people sending me a bare "hello" with no context. I ignore it until there's something obvious to respond to. Responding with the "no hello" webpage will often be received as (passive) aggressive, and that's a bad way to start off a conversation.

Usually within a few minutes there's followup context sent. Either the other party was already in the process of writing the followup, or they realized there was nothing actionable to respond to and they elaborate.


I make it my status message.


The people who need it aren’t the type of people who’d read it.


I tried that on slack for a while, it made no difference. I don't think most people read the status message. The medium lends itself to the "Hi" type messages unfortunately, there's not really a way go constrain human nature, other than to not use instant messaging at all (I also tried changing my status to a note telling people to phone me, that didn't work either)

[deleted by user]

I made it my status message as well and all I got was a complaint passed along from my manager because somebody said that it was too rude and that I should be more gentle with my fellow corporate comrades...


I should have a slack bot that replies automatically to generic greetings… that way they’ll get on with whatever the issue is and I won’t have to reply.


Ha ha, that's a great idea!

A: Hello!

B's bot: Hello to you too! I am a chatty bot which loves responding to greetings. Is there a message I can forward to B?


"No hello" implies that people shouldn't be friendly at all, and comes across as rude.

The concept simply needs a more descriptive name to be accepted. It's not about not saying hello. It's about including the actual request in the first message, usually after the hello.


zsh, any way to download the stuff?


The article is focusing on AI and teams messages for some reason, but the exposed bucket had password, ssh keys, credentials, .env and most probably a lot of proprietary code. I can't even imagine the nightmare it has created internally.


Fortunately not a whole of of data and for sure with a little bit like that there wasn't anything important, confidential or embarrassing in there. Looking forward to Microsoft's itemised list of what was taken, as well as their GDPR related filing.


This seems to be a common occurrence with Big Tech and Big Government, so we better get used to it:


Is this stuff regularly happening to AWS and GCP? This is like the 3rd insane security incident from Microsoft in the past year.


Ok so it’s not Microsoft exposing Microsoft, but government exposing its S3 buckets.

The question should be — why is all that data and power concentrated in one place? Because of the capitalist system and Big Tech, or Big Government.

Personally I am rather happy when “top secret information” is exposed, because that I s the type of thing that harms people around the world more than it helps. The government wants to know who is sending you $600 but doesnt want to tell you how they spent trillions on shadowy “defense” contractors.


Amazing how ingrained it is in some people to just go around security controls.

someone chose to make that SAS have a long expiry and someone chose to make it read-write.


It’s easy.

“ugh, this thing needs to get out by end of week and I can’t scope this key properly, nothing’s working with it.”

“just give it admin privileges and we’ll fix it later”

sometimes they’ll put a short TTL on it, aware of the risk. Then something major breaks a few months later, gets a 15 year expiry, never is remediated.

It’s common because it’s tempting and easy to tell yourself you’ll fix it later, refactor, etc. But then people leave, stuff gets dropped, and security is very rarely a priority in most orgs - let alone remediation of old security issues.


Two of the things that make me cringe are mentioned. Pickle files and SAS tokens. I get nervous dealing with Azure storage. Use RBAC. They should depreciate SAS and account keys IMO.

SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.

My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.

If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.


Pickle files are cringe, but they're also basically unavoidable when working with Python machine learning infrastructure. None of the major ML packages provide a proper model serialization/deserialization mechanism.

In the case of scikit-learn, the code implementing some components does so much crazy dynamic shit that it might not even be feasible to provide a well-engineered serde mechanism without a major rewrite. Or at least, that's roughly what the project's maintainers say whenever they close tickets requesting such a thing.


ONNX[0], model-as-protosbufs, continuing to gain adoption will hopefully solve this issue.



ONNX is cool, but it still only supports a minority of scikit-learn components. Some of them simply aren't compatible with ONNX's basic design.


at work we use the ONNX serialisation format for all of our prod models. Those get loaded by the ONNX runtime for inference. works great.

perhaps it's be viable to add support for the ONNX format even for use cases like model checkpointing during training, etc ?


You should check out safetensors. They are used widely in diffusion models and LLMs


So SAS tokens are worse that some admin setting up "FileDownloaderAccount" and then sharing its password with multiple users or using the same for different applications?

I take SAS tokens with expiration over people setting up shared RBAC account and sharing password for it.

Yes people should do proper RBAC, but point a company and I will find dozens "shared" accounts. People don't care and don't mind. When beating them up with sticks does not solve the issue SAS tokens while still not perfect help quite a lot.


FileDownloaderAccount had no copy pastable secret that can be leaked. Shared passwords are unnecessary of course and not good. If people are going to do that just use OneDrive/Dropbox rather than letting people use advanced things.


Many SOC2 audits are a joke. We were audited this year and were asked to provide screenshots of various categories (but most being of our own choosing in the end). Only requirement was screenshots needed to show date of the computer on which the screenshot had been taken, as if it couldn't be forged as well as the file/exif data.


If you forge your SOC2 evidence you will legitimately wish you were never born once caught


We aren't doing that. I just mention the lazyness of the auditors and that asking for screenshots is just dumb. At this point you can just ask a simply question: do you comply or not?




Absolutely, RBAC should be the default. I would also advocate separate storage accounts for public-facing data, so that any misconfiguration doesn't affect your sensitive data. Just typical "security in layers" thinking that apparently this department in MSFT didn't have.


I wouldn't trust MSFT with my glass of chocolate milk at this point. I would come back to lipstick all over the rim and somehow multiple leaks in the glass


@4mm character width:

4e-6 * 3.8e+13 = 152 million kilometers of text.

Nearly 200 round trips to the moon.


> This case is an example of the new risks organizations face when starting to leverage the power of AI more broadly, as more of their engineers now work with massive amounts of training data.

It seems like a stretch to associate this risk with AI specifically. The era of "big data" started several years before the current AI boom.


Agreed. It should say "new risks organizations face when starting to leverage the power of Azure" or "the power of cloud computing". But as clickbait worthy a title.


AI has magnified the use cases, though. Before, Big Data was an advertising machine meant to tokenize and market to every living being on the planet. Now, machine learning can create "averaged" behavior of just about anything, given enough data and specificity.


This comment is a good bit of rationalization, and whichever the categorical mismatch you feel is happening, it misses the overarching point, the focus should be on the broader systemic issues: data security is not a first or second tier priority to "big data" or "AI"... largely because there's no cost to doing it poorly.


The second clause covers that: this isn’t an AI problem, just as it wasn’t a big data problem when the same kinda of things happened a decade ago. It’s a problem caused when you set up something new outside of what the organization is used to and have people without appropriate training asked to make security decisions: I’d bet that this work was being done by people who were used to the academic style, blending personal and corporate use on the same device, etc. and simply weren’t thinking of this class of problem. The description sounds a lot like the grad students & postdocs I used to support – you’d see some dude with Steam on his workstation because it faster than his laptop and since he was in the lab 70 hours a week anyway, why not 90?

The challenge for organizations is figuring out how to support research projects and other experiments without opening themselves up to this kind of problem or stymieing R&D.


With big data comes big responsibility


This is the risk of using, checks notes, Azure and working with Microsoft.

Except there is no risk for them. They've proven time and again they have major security snafus and not be held accountable.


Virtual networks are a nightmare to setup and manage in Azure which is why everyone just takes the easy path and not bother.

Almost every Azure service we deal with has virtual networks as an after thought because they want to get to market as quickly as possible, and even to them managing vnets is a nightmare.

Not to excuse developers/users though. There are plenty of unsecured S3 buckets, docker containers, and Github repos that expose too much "because it's easier". I've had a developer checkin their ftp creds into a repo the whole company has access to. He even broke the keys up and concat them in shell to work around the static checks "because it's easier" for their dev/test flow.


They have all the regulatory paperwork in place, so it must be fine.


They are also the top line investment for the majority of mutual and pension funds. Don't crab too much, they are funding your retirement.

[deleted by user]



My opinion is that it was not an "accident", but they prepare us for the era where powerful companies will "own" our data in the name of security.

Should have been sent to prison.


Don't get pickled, friends!


Oof. Is that containing code from GitHub private repos?


Just proves how hard it cloud security now. 1-2 mistake and you expose TB's. Insane.


Hard coded secrets in shareable URL’s with almost infinite time windows and an untraceable ability to audit what’s made and shared and at what level?

Sounds like it’s as hard as it’s always been. Pretty basic and filled with humans


I feel like it's harder.

It's no longer hierarchical, with organization schemes limited to folders and files. People no longer talk about network paths, or server names.

Mobile and desktop apps alike go to enormous effort to abstract and hide the location at which a document gets stored, instead everything is tagged and shared across buckets and accounts and domains...

I expect that the people at this organization working on cutting-edge AI are pretty sharp, but it's no surprise that they don't entirely understand the implications of "SAS tokens" and "storage containers" and "permissive access scope" on Azure, and the differences between Account SAS, Service SAS, and User Delegation SAS. Maybe the people at are sharper, but unless I missed the sarcasm, they may be wrong when they say [1] "Generating an Account SAS is a simple process." That looks like a really complicated process!

We just traced back an issue where a bunch of information was missing from a previous employee's projects when we changed his account to a shared mailbox. Turns out that he'd inadvertently been saving and sharing documents from his individual OneDrive on O365 (There's not one drive! There are many! Stop trying to pretend there's only one drive!) instead of the "official" organization-level project folder, and had weird settings on his laptop that pointed every "Save" operation at that personal folder, requiring a byzantine procedure to input a real path to get back to the project folder.



> but unless I missed the sarcasm, they may be wrong when they say [1] "Generating an Account SAS is a simple process." That looks like a really complicated process!

No, unless I understand actually it is intended to be understood the other way:

It is too easy to create a to broad token.

And in the next paragraph, after the image, they explain that in addition to it being easy to create, these tokens are impossible to audit.


My wife and I just rewatched WarGames for the millionth time a few nights ago.

The level of cybersecurity incompetency in the early 80's makes sense; computers (and in particular networked computers) were still relatively new, and there weren't that many external users to begin with, so while the potential impact of a mistake was huge (which of course was the plot of the movie), the likelihood of a horrible thing happening was fairly low just because computers were an expensive, somewhat niche thing.

Fast forward to 2023, and now everyone owns bunches of computers, all of which are connected to a network, and all of which are oodles more powerful than anything in the 80s. Cybersecurity protocols are of course much more mature now, but there's also several orders of magnitude more potential attackers than there were in the 80s.


> Cybersecurity protocols are of course much more mature now

At technical level, sure. At the deployment, configuration and management level, not quite. Overall things are so bad that news aren't even reporting the hospitals taken over by ransomware anymore. It's still happening almost every week and we're just... used to it.


> wardialing

Get a load these guys honey, you could just dial straight into the airline.


That modem setup in Wargames is still a thing for many organizations including some banks and telcos. Not naming names but I suspect the modems will be around for a very long time. Some have a password on their modem but they are usually very simple. Their only saving grace is that they are usually in front of a mainframe speaking proprietary MML that only old fuddy duddies like me would remember. There are a few of us here


> proprietary MML that only old fuddy duddies like me would remember.

Security through obscurity helps only until someone gets curious/determined. I have a personal anecdote for that. During university I was involved in pentesting an industrial control system (not in an industrial context, but same technology) and implemented a simple mitm attack to change the state of the controls while displaying the operator selected state. When talking with the responsible parties, they just assumed that the required niche knowledge means the attack is not feasible. I had the first dummy implementation setup on the train ride home based only on network captures. Took another day to fine tune once I got my hands on a proper setup and worked fine after that.

I do not want to say that ModbusTCP is in the same league as MML, but if there is interest in it, someone will figure it out. Sure, you might not be on Shodan, but are the standard/scripted attacks really what you should worry about? Also don't underestimate a curious kid who nerdsnipes themself into figuring that stuff out.


Security through obscurity helps only until someone gets curious/determined.

Absolutely. It just weeds out the skiddies and tools like MetaSploit unless they have added mainframe support. I have not kept up with their libraries

The federal agencies I was liaison to knew all the commands better than I did and even taught me a few that were not in my documentation which led to a discussion with the mainframe developers.


what does this have to do with a "modem" per se?


The parent comment was about the movie Wargames and the questionable security of the 80's that is still in use today. That security in Wargames was a modem that provided access to a subsystem of the WOPR mainframe named "Joshua". Joshua had super-user privs on the mainframe.

It was likely meant to be a temporary means for the system architect to monitor and improve the system after it was deployed but then life changing circumstances may have distracted his attention away from decommissioning the modem. The movie still holds up today and is worth a watch. Actually it may be more pertinent now than ever.


i still love the phreaking scene trying to make a phone call where he uses the can pull tab to ground the phone. it was more of a phreaker vibe than trying to whistle into the phone or social engineer an operator or just happening to have a dialer on him.


Yeah, when we were rewatching it, we were kind of amazed at how well it holds up, all things considered.

I think what makes it likable for me is that it's all on the cusp of believability. Obviously LLMs weren't quite mature enough to do everything Joshua did back then (and probably not now), but the fact that the "hacking" was basically just social engineering, and was just achieved by wardialing and a bit of creative thinking makes it somewhat charming, even today.

With the advent of LLMs being used increasingly for everyone, I do wonder how close we're going to get to some kind of "Global Thermonuclear War" simulation gone awry.


> I suspect the modems will be around for a very long time.

No they won't.

'Dial up' modems need a PSTN line to work. The roll out of full fibre networks means analogue PSTN is going the way of the dodo. You cannot get a new PSTN line anymore in Blighty. In Estonia and the Netherlands (IIRC) the PSTN switch off is already complete.


I should have restricted that statement to include the United States of America. PSTN's are still utilized, deployed and actively sold in most of the US. As a side note I recently tried to get a telco to remove a phone line and two poles and they refused to do it. Their excuse was that they might one day run fiber over it despite there already being a fiber network here. I hope they do as my fiber ISP really does need a competitor. If they really do run the fiber over those poles vs burying it that would be amusing.

To your point I am sure some day the US will stop selling access to the PSTN but some old systems will hold on for dear life, government contracts and all. Governments are kindof slow to migrate to newer things.


> As a side note I recently tried to get a telco to remove a phone line and two poles and they refused to do it.

You need to align their incentives with yours: wait until it gets windy out, knock the poles down, and demand that they come fix it.


I've been secretly hoping an over-sized big rig would take them out but I would not want anyone to get hurt. They are the only poles within a few miles and are an eye-sore.


>'Dial up' modems need a PSTN line to work

Cable company here (US) still sells service that has POTS over cable modem. Just plug your modem into the cable modem tele slot and you have a dialton. Now, are you getting super high speed connections, no, but that's not what you need for most hacking like this. Not that I recommend hacking from your own house.


Surely there’s a vendor that will sell you a v.22bis modem that works over VoIP if that’s what your two mainframes need to sync up, and you’re buying the multimillion dollar support contract…


Microsoft, too big to fa.. care.


A number of replies here are noting (correctly) how this doesn't have much to do with AI (despite some sentences in this article kind of implicating it; the title doesn't really, fwiw) and is more of an issue with cloud providers, confusing ways in which security tokens apply to data being shared publicly, and dealing with big data downloads (which isn't terribly new)...

...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.

> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.

[deleted by user]

Occasionally, I’ll talk to someone suggesting a dynamically typed language (or stringly-typed java) for a very large scale (in developer count) security or mission critical application.

This incident is a good one to point back to.


that has literally nothing to do with the topic, which is just misconfigured cloud stuff. people really like starting these old crappy language arguments anywhere they can


types have nothing to do with this, strictly speaking; the same problems would exist if you serialised structures containing functions in a typed language to e.g. a dll or a .class file and asked users to load it at runtime

the problem is in fact the far more subtle principle of "don't download and run random code, and definitely don't make it the idiomatic way to do things," and i'm not sure you can blame your use of eval()-like things on the fact that they exist in your language in the first place


The difference is that no one shares data in a statically typed language by sending over dlls or .class files. The entire point is that something so dangerous has been normalized because of dynamic typing.


poor engineering choices are just that, choices


Some tools make poor choices harder or impossible. That's the entire point of static typing too. In this case python encouraged insecure design choices by making them very easy and even presenting them to users.


Yeah, because statically typed language never had any kind of deserialization vulnerabilities.




What is the best practice? I'm assuming something that isn't a programming language object...


laughs in log4j vuln

A good fraction of the flaws we found at Matasano involved pentests against statically typed languages. If an adversary has root access to your storage box, they can likely find ways to pivot their access. Netpens were designed to do that, and those were the most fun; they’d parachute us into a random network, give us non-root creds, and say “try to find as many other servers that you can get to.” It was hard, but we’d find ways, and it almost never involved modifying existing files. It wasn’t necessary — the bash history always had so many useful points of interest.

It’s true that the dynamics are a little different there, since that’s a running server rather than a storage box. But those two employees’ hard drive backups have an almost 100% chance of containing at least one pivot vector.

Sadly choice of technology turns out to be irrelevant, and can even lead to overconfidence. The solution is to pay for regular security testing, and not just the automated kind. Get someone in there to try to sleuth out attack vectors by hand. It’s expensive, but it pays off.


Am I one of few people who is frightened by shell history files? I always disable mine because it just seems like a roadmap to interesting stuff for anyone who might gain access to it. Including even stuff like sudo passwords typed at the wrong time or into the wrong window.


Sure. But, you could auto-encrypt your ~/.bash_history if you're concerned about it being a problem and might need it for backtracing any issues etc?


The terminal backlog is just sitting in memory as well. Just don’t leave passwords there, remove them immediately. You also have an option not to save the command in history, e.g. whitespace prefix in bash. Half of my bash commands that are longer than 20 symbols start with ^R to look up a similar command and edit it, not having history would make that much slower.


The typing of python isn’t the issue, it’s effectively the eval problem of not having a separation between code and data in the pickle format often used out of convenience. There are lots of pure data containers, like huggingface’s safe tensors or tensorflow’s protobuf checkpoints, that could have been used instead.


I’ll venture that it’s at least adjacent that the indiscriminate assembly of massive, serious pluralities of the commons on a purely unilateral basis for profit is sort of a “just try and stop us” posture that whether or not directly related here, and clearly with some precedent, is looking to create a lot of this sort of thing over and above the status-quo ick.


I have no idea what you are saying. If it is: "bad incentives cause people to misbehave", you generated an impressive verbiage around it :)


I have a bad habit of using 5 words when 1 will do: but I was saying that the probably fucking illegal status quo on AI corpus assembly is making an already ugly world a lot fucking worse.


Many people are also unaware that json is way, way, way faster than Python pickles, and human-editing-friendly. Not that you'd use it for neural net weights, but I see people use Python pickles all the time for things that json would have worked perfectly well.


Are you sure json is faster than pickle in recent python versions? That's not intuitive to me and search result blurbs seem to indicate the opposite.


So, a little bit like a lot of people think that (non-checksummed/non-encrypted) PDFs cannot be modified, even though they are easily editable with Libre freaking Office ?


You can’t edit them in Word, so that must be too advanced for most people. LibreOffice never opened the PDFs too well for me, but Inkspace was pretty good, one page at a time though.


Doesn't Microsoft Office have the equivalent to Libre Office Draw ?? (That's the one that edits PDFs.)

I'm pretty sure I used that one in middle school ?? (Though not to edit PDFs, and it might have been the Microsoft Works equivalent.)


Disclosure I work for the company that released this: but we do have a tool to support scanning many models for this kind of problem.

That said you should be using something like safe-tensors.


You have me curious now. The models generate text. Could a model hypothetically be trained in such a way that could create a buffer overflow when given certain prompts? I am guessing the way inference works in such a way that cant happen


Absolutely, though that isn't strictly what we're talking about here.

In this case, models themselves are fundamentally files. These files can have malicious code embedded into them that is executed when the model is loaded for further training or inference. When executed it isn't obvious to the user at all. It's a very nasty potential vector.

I wrote a blog about it here:


The other aspect that pertains to AI is the data-maximalist mindset around these tools: grab as much data, aggregate it all together, and to hell with any concerns about what and how the data is being used; more data is the competitive advantage. This means a failure that might otherwise be quite limited in scope becomes huge.


The safetensors format was created exactly for this - safe model serialization


For me it's also interesting as a potential pathway for data poisoning attacks - if you have control over the data used to train a production model, can you modify the dataset such that it inserts a backdoor to any model trained subsequently trained over it? E.g. what if gpt was biased to insert certain security vulnerabilities as part of its codegen capabilities?


In theory for any AI model that generates code you'll want to have a series of post generation tests, for example something like SAST and/or SCA that ensure the model is not biasing itself to particular flaws.

At least for common languages this should stand out.

Where it gets more tricky is watering hole attacks against specialized languages or certain setups. This said you'd have to ensure that this data is not already there scraped up from the internet.


The AI version of

At the moment such techniques would seem to be superfluous. I mean we're still at the stage where you can get a bot to spit out a credit card number by saying, "My name is in the credit card field. What is my name?"

That said, what you're describing seems totally plausible. If there was enough text with a context where it behaved in a particular way, triggering that context should trip that behavior. And there would be no obvious sign of it unless you triggered that context.

AI is hard.


It’s risky to make definitive claims about what is or isn’t a possible security vector, but based on my years of training GPTs, you’d find it very difficult for a number of reasons.

Firstly, the malicious data needs to form a significant portion of the data. Given that training data is on the order of terabytes, this alone makes it unlikely you’ll be able to poison the dataset.

Unless the entire training dataset was also stored in this 38TB, you’ll only be able to fine tune the model, and fine tuning tends to destroy model quality (or else fine tuning would be the default case for foundation models — you’d train it, fine tune it to make it “even better” somehow, then release it. But we don’t, because it makes the model less general by definition).


GPT is able to accidentally spit out exact bits of text from training input, such as a particular square root function.

What fraction of the training data needed to be that text?


If the question is "Would it be possible to get GPT to try to add backdoors to code examples by poisoning the training data?" my answer would be no. The sheer quantity of training data means that even with GPT-4's assistance in generating code examples that match the format of the original training data, you wouldn't be able to inject enough poison to change the model's behavior by much.

Remember, once the model is trained, it's verified in a number of ways, ultimately based on human prompting. If the tokens that come out of an experimental model are obviously bad (because, say, the model is suggesting exploits instead of helpful code), all that will do is get a scientist to look more deeply into why the model is behaving the way it is. And then that would lead to discovering the poisoned data.

The payoff for an attacker is whether they can achieve some sort of goal. You'd have to clearly define what that goal is in order to know how effective the poisoning attack could be. What's the end game?


As I commented elsewhere, GPT is such a target rich security environment that it is hard to know why you would bother with this. On the other hand, advanced persistent attackers (eg the NSA) have a pretty good imagination. I could see them having both motive and means to go out of their way to achieve a particular result.

On human checks, demonstrates that it would be possible to inject content that will pass that.


Makes me wonder if there would be a way to pollute imagenet so a particular image would always match for something like a facial recognition access control system or the like. Maybe adversarial data that would hide particular traffic patterns from an AI enabled IDS would be more plausible and something the NSA might be interested in.


I don't disagree with you on targeted attacks, but if you're creating output at scale then I'd say there's marginally more risk.

It's possible there's some minimum amount of poisoned data (a % or log function of a given dataset size n) that would then translate to generating a vulnerable output in x% of total outputs. If x is low enough to get past fine tuning/regression testing but high enough to still occur within the deployment space, then you've effectively created a new category of supply-chain attack.

There's probably more research that needs to be done into occurrence rate of poisoned data showing up in final output, and that result is likely specific to the AI model and/or version.


That's a lot of data.


The article tries to play up the AI angle, but this was a pretty standard misconfiguration of a storage token. This kind of thing happens shockingly often, and it’s why frequent pentests are important.


Pentests where people actually get out of bed to do stuff (read code, read API docs etc) and then try to really hack your system are rare. Pentests where people go through the motions, send you report with a few unimportant bits highlit while patting you on the back for your exemplary security so you can check the box on whatever audit you're going through are common.


If you're a large company that's actually serious about security, you'll have a Red Team that is intimately familiar with your tech stacks, procedures, business model, etc. This team will be far better at emulating motivated attackers (as well as providing bespoke mitigation advice, vetting and testing solutions, etc.).

Unfortunately, compliance/customer requirements often stipulate having penetration tests performed by third parties. So for business reasons, these same companies, will also hire low-quality pen-tests from "check-box pen-test" firms.

So when you see that $10K "complete pen-test" being advertised as being used by [INSERT BIG SERIOUS NAME HERE], good chance this is why.


Ugh, in the work I do I run into so much of this kind of stuff.

Customer: "We had a pentest/security scan/whatever find this issue in your software"

Me: "And they realized that mitigations are in place as per the CVE that keep that issue from being an exploitable issue, right"

Customer: "Uhhhh"

Testing group: "Use smaller words please, we only click some buttons and this is the report that gets generated"


what I always want to know when people talk about this is "what reputable companies can I actually pay to do a real pentest (without costing hundreds of thousands of dollars)."


I think hiring a security specialist is the way to go.


The problem is security is a "Market for lemons" Just like when trying to buy a used car, you need someone who is basically an expert in selling used cars.

In order to purchase a reputable pentest, you basically have to have a security team that is mature enough to have just done it themselves.

I can throw out some names for some reputable firms, but you are still going to need to do some leg work vetting the people they will staff your project with, and who knows if those firms will be any good next year or the year after.

Here's a couple generic tips from an old pentester:

* Do not try and schedule your pentest in Q4, everyone is too busy. Go for late Q1 or Q2. Also say you are willing to wait for the best fit testers to be available.

* Ask to review resumes of the testing team. They should have some experience with your tech and at least one of them needs to have at least 2 years experience pen-testing.

* Make sure your testing environment is set up, as production like as possible, and has data in it already. Test the external access. Test all the credentials, once after you generated them, again the night before the test starts. The most common reason to lose your good pentest team and get some juniors swapped in that have no idea what they are doing is you delayed the project by not being ready day 1.


Let me tell you about the laptop connected to our network with a cellular antenna we found in a locked filing cabinet after getting a much-delayed forced-door alert. This, after some social engineering attempts that displayed unnerving familiarity with employees and a lot of virtual doorknob-rattling.

They may be rare, but "real" pentests are still a thing.


Ouch. How did that ended up?

[deleted by user]

Yep, most pentests go through the OWASP list and call it done.


Honestly, the OWASP top ten is generic enough that most vulnerability fit in it : "injection", "security misconfiguration", "insecure design".

The problem is

1. knowing the gazillion of web vulnerabilities, and technologies

2. being good enough to tests them

3. kick yourself and go through the laborious process of understand and test every key feature of the target.


The problem is that is what most companies want. They don't want to spend the money nor get the feedback beyond "Best case standards". It's a calculated risk.


It's great if it's done exhaustively


From my understanding as a non security expert:

Pentest comes across more as checking all the common attack vectors don’t exist.

Getting out of bed to do the so-called “real stuff” is typically called a bug bounty program or security researching.

Both exist and I don’t see why most companies couldn’t start a bug bounty program if they really cared a lot about the “real stuff”


pentest means penetration testing which mean one need to take the attacker hat and try to enter your network or the app infrastructure and get as much data as he can, be it institutionnal or customer data. It can be through technical means as well as social engineering practices. And then report back.

This is in no way related to a bug bounty program.


Counter point: Most of the top rated Bug Bounty hunters have a background in penetration testing.

I think it's more accurate to say Bug Bounty only covers a small subset of penetration testing (mainly in that escalation and internal pivoting are against the BB policy of most companies).


Bug bounty programs are a nightmare to run. For every real bug reported you’ll get thousands of nikto pdfs with CRITICAL in big red scare letters all over them. Then you’ll get dragged on twitter constantly for not being serious about security. Narrowing the field to vetted experts will similarly get you roasted for either having something to hide or not caring about inclusion. And god help you if you have to explain that you already knew about a bug reported by anyone with more than 30 followers…

There are as many taxonomies of security services as there are companies selling them. You have to be very specific about what you want and then read the contract carefully.


I think the concern is more about the theatre of most modern pen-testing rather than expecting deep bug-bounty work. I'm not a security expert either, but I've had to refute "security expert" consultations from pen-test companies, and the reports are absolutely asinine half the time and filled with so many false positives due to very weak signature matching that they're more or less useless and give a false sense of security.

For example, dealing with a "legal threat" situation with the product I work on because a client got hit by ransomware and they blame our product because "we just got a security assessment saying everything was fine, and your product is the only other thing on the servers" -- checked the report, basically it just runs some extremely basic port checks/windows config checks that haven't been relevant for years and didn't even apply to the Windows versions they had, and in the end the actual attack came from someone in their company opening a malicious email and having a .txt file with passwords.

I don't doubt there are proper security firms out there, but I rarely encounter them.


That’s interesting. I thought maybe it’s a resource constraint issue, where companies prioritise investment in other areas and do the minimum to “get certified” but it sounds like finding a good provider can be extremely difficult.


I work as pentester (as a freelance nowdays).

Getting out of bed and "real stuff" is supposed to be part of a pentest.

The problem is more the sheer amout of stuff your are supposed to know to be a pentester. Most pentesters come into the field by knowing a bit of XSS, a few thing about PHP, and SQL injections.

Then you start to work, and the clients need you to tests things like:

- compromise a full Windows Network, and take control of the Active Directory Server. Because of a misconfiguration of Active Directory Certificate Services. While dealing with Windows Defender

- test a web application that use websockets, React, nodejs, and GraphQL

- test a WindDev application, with a Java Backend on a AIX server

- check the security of an architecture with multiple services that use a Single Sign on, and Kubernetes

- exploit multiple memory corruption issues ranging form buffer overflow to heap and kernel exploitation

- evaluate the security of an IoT device, with a firmware OTA update and secure boot.

- be familiar with cloud tokens, and compliance with European data protection law.

- Mobile Security, with iOS and Android

- Network : radius, ARP cache poisoning, write a Scapy Layer for a custom protocol, etc

- Cryptography, you might need it

Most of this is actual stuff I had to work on at some point.

Even if you just do web, you should be able to detect and exploit all those vulnerabilities:

Nobody knows everything. Being a pentester is a journey.

So in the end, most pentesters fall short on a lot this. Even with an OSCP certification, you don't know most of what you should know. I heard that in some company, people don't even try and just give you the results of a Nessus scan. But even if you are competent, sooner or later, you will run into something that you don't understand. And you have max 2 week to get familiar with it and test it. You can't test something that you don't understand.

The scanner always gives you a few things that are wrong (looking at you TLS ciphers). Even if you suck, or if the system is really secure. You can put a few things into your report. As a junior pentester, my biggest fear was always to hand an empty report. What were people going to think of you, if you work 1 week and don't find anything?


Thanks for your honest reply. This part was my favourite:

    Nobody knows everything. Being a pentester is a journey.
I recommend that you add some contact details in your HN bio page. You might get some good ledes after those post.

>As a junior pentester, my biggest fear was always to hand an empty report.

I'm trying to remember the rule where you leave something intentionally misconfigured/wrong for the compliance people to find and that you can fix so they don't look deeper into the system. A fun one with web servers is to get them to report they are some ancient version that runs on a different operating system. Like your IIS server showing it's Apache 2.2 or vice versa.

But at least from your description it sounds like you're attempting to pentest. So many of these pentesting firms are click a button, run a script, send a report and go on to the 5 other tickets you have that day type of firms.


> From my understanding as a non security expert:

That certainly helps.


People are going to chit-chat about things only tangentially related to their areas of expertise; it is good when we’re honest about our limitations.

If nothing else, an obviously wrong take is a nice setup for a correction.


What a shame, HackerNews typically has more insightful comments than garbage like this.

Edit: thanks to everyone who wrote some insightful responses, and there are indeed many. Faith in HackerNews restored !


Not really.

Real stuff should always be a pentest - penetration test where one is actively trying to exploit vulnerabilities. So person who orders that gets report with !!exploitable vulnerabilities!!.

Checking all common attack vectors is vulnerability scanning and is mostly running scanner and weeding out false positives but not trying to exploit any. Unfortunately most of companies/people call that a penetration test, while it cannot be, because there is no attempt at penetration. While automated scanning tools might do some magic to confirm vulnerability it still is not a penetration test.

In the end, bug bounty program is different in a way - you never know if any security researcher will even be interested in testing your system. So in reality you want to order penetration test. There is usually also a difference where scope of bug bounty program is limited to what is available publicly. Where company systems might not allow to create an account for non-business users, then security researcher will never have access to authenticated account to do the stuff. Bounty program has also other limitations because pentesting company gets a contract and can get much more access like do a white box test where they know the code and can work through it to prove there is exploitable issue.


The checkbox form exists because crooked vendors are catering to organizations who are intentionally lazy about their cybersecurity.

Real penetration tests provide valuable insight that a bug bounty program won't.


As in every industry there are cheapskates, and especially in pentesting it is often hard for the customer to tell the good ones from the bad ones. Nevertheless, I think that you have never worked with a credible pentesting vendor. I am doing these tests for a living and would be ashamed to deliver anything coming near your description :-)


Cloud buckets have all sorts of toxic underdevelopment of features. They play make believe that they're file systems for adoption.

Like for starters, why is it so hard to determine effective access in their permissions models?

Why is the "type" of files so poorly modeled? Do I ever allow people to give effective public access to a file "type" that the bucket can't understand?

For example, what is the "type" of code? It doesn't have to be this big complex thing. The security scanners GitHub uses knows that there's a difference between code with and without "high entropy strings" aka passwords and keys. Or if it looks like data:content/type;base64, then at least I know it's probably an image.

What if it's weird binary files like .safetensors? Someone here saying you might "accidentally" release the GPT4 weights. I guess just don't let someone put those on a public-resolvable bucket, ever, without an explicit, uninherited manifest / metadata permitting that specific file.

Microsoft owns the operating system! I bet in two weeks, the Azure and Windows teams can figure out how to make a unified policy manifest / metadata for NTFS & ReFS files that Azure's buckets can understand. Then again, they don't give deduplication to Windows 11 users, their problem isn't engineering, it's the financialization of essential security features. Well jokes on you guys, if you make it a pain for everybody, you make it a pain for yourself, and you're the #1 user of Azure.


> it’s why frequent pentests are important.

Unfortunately a lot of pen testing services have devolved into "We know you need a report for SOC 2, but don't worry, we can do some light security testing and generate a report for you in a few days and you'll be able to check the box for compliance"

Which is guess is better than nothing.

If anyone works at a company that does pen tests for compliance purposes, I'd recommend advocating internally for doing a "quick, easy, and cheap" pen test to "check the box" for compliance, _alongside_ a more comprehensive pen test (maybe call it something other than a "pen test" to convince internal stakeholders who might be afraid that a 2nd in depth pen test might weaken their compliance posture since the report is typically shared with sales prospects)

Ideally grey box or white box testing (provide access to codebase / infrastructure to make finding bugs easier). Most pen tests done for compliance purposes are black-box and limit their findings as a result.


Narrowly scoped tests designed for specific compliance requirements are fine. They lower the barrier to entry to some degree for even getting testing and still, or often enough, return viable results. There's also SAAS companies that have emerged that effectively run a scripted analysis of cloud resources. The two together are more economical and still accomplish the goals that having compliance in the first place sets out.

When I was consulting architecture and code review were separate services with a very different rate from pentesting. Similar goals but far more expensive.


I recently ran into something along the lines of your devolved pentest concept. I have a public facing webapp, and the report came back with a list of "critical" issues that are solved by yum update. Nothing about vulnerability to session jacking or anything along the lines of requiring actual work. I was a few steps removed from the actual testing, so who knows what was lost in translation and it being the first time I've ever had something I worked on pen tested. However, I feel this was more of a script kiddie port scan level of effort vs actually trying to provide useful security advice. The whole process was very disappointing.


How behind on yum updates were you anyway?


not very. i guess i was too cavalier in hand waving it as a yum update. some of it was switching to a new repo with the most recent version available. but that was still just using yum. not like it required changes to the code base and workflow. maybe it was an amazon-linux-extras command for the actual package change, but still.


I've seen worse. Couple years back, there was an audit that included an internal system I've been working on. It was running on Debian oldstable because of a vital proprietary library I wasn't able to get working on stable at the time, but it had unattended upgrades set up and all that.

The company made some basic port scan and established that we're running outdated and vulnerable version of Apache. I found the act of explaining the concept of backports to a "pentester" to be physically painful.

They didn't get paid and another company was entrusted with the audit.


This is why I always attempt to turn off as much version information output as possible from any service. Make the pentester do their homework and not just look at "Apache 2.XX"

Hopefully you also have an internal control that looks at actual package versions installed on the server.


Normally I do that too, but this was fairly new and internal application that was still in development, so that's why it was there. And if it wasn't for this incident, they might actually trick our management into thinking they're somehow qualified to carry out such an audit.


This is actually a take away that I did implement. it's one of those that's not actively a vuln, but might provide info on what other attacks to try.


How would a pentest find that? Ok in this case it's splattered onto github; but the main point here is that you might have some unknown number of SAS tokens issued to unknown storage that you probably haven't any easy way to revoke.


A number of ways, including:

- finding the token directly in the repo

- reviewing all tokens issued


Did you read TFA? It does mention AI, and also mentions that this is less about AI and more about the fact that the AI researchers had a TON of data to share, and their method for doing so was poorly configured SAS tokens…

Which also, in the article, is mentioned can not be tracked - issued tokens happen on the client side (if I understood this correctly), which means that to audit tokens you’d have to ask everyone who had one issued to politely provide said token. Will everyone remember the tokens they have? Probably not. And if an attacker has already gotten what they needed, or managed to issue their own, no one would know.


AI data is highly centralized and not stored in a serially-accessed database, which makes it unusual inasmuch as 40TB of interesting data does not often get put into a single storage bucket.


It was so common that S3 added several features to make it really, really hard to accidentally leave a whole bucket public.

Looks like Azure hasn't done similarly.


Is there any valid use case for when it's a good idea to publicly expose a S3 bucket?


Sharing of datasets, disk images, ISOs, ML models, etc, as well as public websites.


It didn’t seem to be focused on AI except for the very reasonable concerns that AI research involves lots of data and often also people without much security experience. Seeing things like personal computer backups in the dump immediately suggests that this was a quasi-academic division with a lot less attention to traditional IT standards: I’d be shocked if a Windows engineer could commit a ton of personal data, passwords, API keys, etc. and first hear about it from an outside researcher.


Embrace, extend, and extinguish cybersecurity with AI. It's the Microsoft way.


At this point MS might as well aquire Wiz, given the number of azure security findings they have found.


Would be cool if someone analysed - i am fairly certain it has proprietary code and data laying around. Would be useful for future lawsuits against microsoft and others that steal people’s ip for “training” purposes.


This is so unfortunate but a clear illustration of something I've been thinking about a lot when it comes to LLMs and AI. It seems like we're forgetting that we are just handing our data over to these companies on a solver platter in the form of our prompts. Disclosure that I do work for and we are working on a way to automatically redact any information you send to an LLM -


This stands out

> Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.

Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.


Why not even?

Security was never a strong part of Microsoft.


Part of me thought "this is fine as very few could actually download 38TB".

But that's not true as it's just so cheap to spin up a machine and some storage on a Cloud provider and deal with it later.

It's also not true as I've got a 1Gbps internet connection and 112TB usable in my local NAS.

All of a sudden (over a decade) all the numbers got big and massive data exfiltration just looks to be trivial.

I mean, obviously that's the sales pitch... you need this vendor's monitoring and security, but that's not a bad sales pitch as you need to be able to imagine and think of the risk to monitor for it and most engineers aren't thinking that way.


Agree, this is extremely dubious:

5gbps and 10gbps residential fiber connections are common now.

12TB hd's cost under $100, so you would only need about $400 of storage to capture this, my SAN has more capacity than this and I bought basically the cheapest disks I could for it.

It only takes one person to download it and make a torrent for it to be spread arbitrarily.

People could target more interesting subsets over less interesting parts of the data.

Multiple downloaders could share what they have and let an interested party assemble what is then available.


Trivial in a technical sense but monitoring capabilities (hopefully) have increased in kind.

[deleted by user]

At the rack rates of $.05/GB, that’d come out to $1,945 per copy that’s downloaded. So not only do you have the breach, you also have a fat bill too.


> $.05/GB

That's just a scam rate by AWS. The true price is 1/100th of that, if that.


with a 1Gbps connection you're still looking at ~248 hours to download, and that's if the remote server can keep up, which it almost certainly can't

this is assuming by 1Gbps you mean 1 Gigabit/s rather than 1 Gigabyte/s


But you don't need to download everything. Even 1/10th of that could be juicy enough. Or 1/100th.


Not sure where 248 hours came from.

38 terabytes = 304 terabits.

304 terabits / 1 gigabit/second = 304,000 seconds

304,000 seconds =~ 84 hours. Add 20% for not pegging the line the whole time and the limits of 1gbps ethernet, and perhaps 100 hours is reasonable.


my mistake, I swapped the 38tb and 112tb from parent comment

whatever the download size is, you're bottlenecked by the remote server's up speed


If the "remote server" is Azure, the target throughput is 0.5gbps ... for each large blob (of which this leak includes many). It seems pretty likely you'll be able to download at a few gigabits per second if your local connectivity allows.


that's a big if


We're talking about exfiltrating data from incorrect permissions on Azure, so it's not an if. It's a given for the situation in the article that we're discussing in this thread.

[deleted by user]

Not really a sales pitch as it wasn't discovered by their product but rather by their security team doing a bunch of manual work.


How do you have your NAS configured? The more specifics, the better; I’ve wanted one.

Do you worry about failure? In your hardware life I mean, not your personal life.


I just have a Synology DS1821+ which has (8 * HDD bays) + (2 * M2 slots). The bays I've filled with 18TB HDDs (I chose Toshiba N300 as they do not use SMR). The M2 slots I've put a couple of 1TB M2 drives in as an SSD cached (they better allow the HDDs to hibernate for frequently accessed files like music).

I've got these in an SHR configuration (Synology Hybrid Raid with 1 disk of protection) which means about 115-6TB of usable space and allowing for single drive failure.

The filesystem is BTRFS ( ).

I upgraded the RAM (Synology will forever nag about it not being their RAM ).

I have the option in future to purchase the network card to take that to 10Gbps ports rather than 1Gbps ports.

So that's the first... but then I have a second one... which is an older DS1817+ which is filled with 10TB HDDs and yields 54.5TB usable in SHR2 + BTRFS... which I use as a backup to the first, but as it's smaller just the really important stuff and it is disconnected and powered down mostly, it's a monthly chore to connect it, and rsync things over. Typically if I want to massively expand a NAS (every - 10 years) I will buy a whole new one and relegate the existing to be a backup device. Meaning an enclosure has on avg about 15y of life in it and amortises really well as being initially the primary, and then later the backup.

I do _not_ use any of the Synology software, it's just a file system... I prefer to keep my NAS simple and offload any compute to other small devices/machines. This is in part because of the length of time I keep these things in service... the software is nearly always the weakest link here.

You can build your own NAS, TrueNAS Core (nee FreeNAS) is very good... but for me, a NAS is always on and the low power performance of this purpose built devices and their ability to handle environmental conditions (I am not doing anything special for cooling, etc) and the long-term updates to the OS, etc... makes it quite compelling.


Not the OP but I have a pair of Chenbro NR12000 1U rack mount servers, bought for about $120 each on eBay a few years ago. Each has 12 internal 3.5" mounting points and 14 SATA cables. In one server, I have 12 4TB used enterprise drives. In the other, I have 12 8TB drives. Both have 16 GB of RAM (should probably be more) and two 2.5" SATA SSDs. They are configured with two ZFS raidz1 vdevs, each made up of 6 disks. This gives me 10 usable disks and 2 used for parity, and the ability to survive at least one failure but maybe two (if I'm lucky).

I back up critical data from the 80TB NAS to the 40TB NAS, and the most critical data gets backed up nightly to a single hard drive in my friend's NAS box (offsite). Twice a year, I back up the full thing to external hard drives and take them out of state to a different friend's house.

Don't worry, be happy.


What are you criteria for used enterprise drives? I'm wading into building a nas (well.. it's more of a 'project' nas as an above comment would say) and I'm getting a little lost in the sauce about drives.


I just bought the cheapest "Grade A" drives I could find from eBay. This is not the reliable way to do it, but as I have a 3 layer backup solution anyway, I don't really mind the risk of a drive failure.

It depends on what your plans for the storage are. If you're going to fill it with bulk data that gets accessed sequentially (think media files), then performance will be fine with basically any topology or drive choice. If you are going to fill it with data for training ML models across multiple machines, you need to think about how you will make it not the bottleneck for your setup.

One more thing to consider - you can get new consumer OR used enterprise flash for somewhere around $45/TB in the 4 TB SATA size, or the 8 TB NVMe size. Those drives will likely fail read-only if they fail at all. They will usually use less power, take less space, and obviously will perform orders of magnitude better than spinning rust, at somewhere around 3x the cost.

I am hoping to build my next NAS entirely on flash.


(Where are you finding friends with a NAS? Or at all, for that matter… guess I’ll look on eBay.)

Thank you for the details, particularly about zfs, which I know nothing about. The “if I’m lucky” part piqued my interest. HN was recently taken down by a double disk failure, which is exponentially more likely when you buy drives in bulk - the default case. So being able to survive two failures simultaneously is something I’d like to design for.

It’s cool you have two NASes (NASen?) let alone one. They’re the Pokémon of the tech world.


Interesting. It's been a while since I've used eBay, but man they've really upped their game if you can buy friends there now.


OP was pulling your leg a bit. Clearly the only friends folks like us have with NAS are the friends here on HN posting about their NAS.


Ah my tech friends have specialized into hardware a bit. At least two of us have server racks in our basement, and basically nobody I know (who at least knows the command line) does not have at least a few drives in an old Linux server somewhere.

If you are concerned about reliability above performance, I would suggest using a single raidz2 vdev instead. This would allow the cluster to definitely survive two disks worth of failure. I'll also echo the common mantra - RAID is not backups. If you really need the data, you need to store a second copy offline in a different place.

When I lived in California and did not have room for a server rack, I had a single home server with an 8-bay tower case. I used an LSI card with 2 SAS-to-4x-SATA ports to connect all 8 drives to the machine. I believe I had 6 TB drives in that NAS, though they are currently all out of my house (part of one of my offsite backups now). My topology there was 4x mirror vdevs, which gave me worst case endurance of 1 failure but best case of 4 failures, and at about 4x the IOPS performance, but with the cost of only 50% storage efficiency vs the 75% you would get with raidz2.

There is even raidz3 if you are very paranoid, which allows up to 3 disks to fail before you lose the vdev. I've never used it. As I understand, the parity calculations get considerably more complicated, although I don't know if that really matters.


Not the original poster, but to add my experience:

Two-bay NAS, two drives as a mirrored pair, two SSDs as mirrored pair cache. Only makes data available on my home network. Primarily using Nextcloud and Gitea.

It backs up important files nightly to a USB-attached drive, less critical files weekly. I have a weekly backup to a cloud provider for critical files.

A sibling comment makes a good point: do you want a hobby or an appliance? Using a commercial NAS makes it closer to an appliance[0]. Building it yourself will likely require more fiddling.

If you want to run a different OS on a commercial NAS, dig deeper into the OS requirements before buying a the NAS. Asustor Lockerstor Gen 2 series' fan is not inherently supported by things other than Asustor's software.

[0] A commercial NAS will still require monitoring, maintenance, and validation of backups.


Unraid is a pretty friendly OS with easy disk adoption and nice gui for managing docker containers.

You can have up to two disks of redundancy (dual parity) per drive pool.


Not the OP, but after a lot of messing with software software and OS RAID, Raid Cards and mother boards, dedicated loud Dell servers, UnRAID, this that and the other thing over years and decades, I just set up a big Synology device 5 years ago. Since then, I've had a NAS that just worked. I have data, it's there.

I do online backup to a cloud provider, and a monthly dump to external USB drives that I keep and rotate at my mother in law's house (off site:).

More than any technical advice, I'd strongly urge you to check and understand honestly whether you're looking for "NAS" (a place to seamlessly store data) or "a project" (something to spend fun and frustrating and exciting evening and weekend time configuring, upgrading, troubleshooting, changing, re-designing, replacing, blogging, etc). Nothing wrong with either, just ensure you pick the path you actually want :->


Which model Synology do you have? (Would you still make the same choice today?)

Did you settle on using RAID, or just rely on cloud backups?


I have the DS918+

I would not make the same choices today: I got a somewhat high end one and upgraded it to whopping 32GB of RAM, thinking I'd use it for running lightweight containers or VMs, and maybe a media server. But once I put all my data on it... including 20 years of family photos and tax prep documents and work stuff and everything else... I changed my mind and am using it only and solely as an internal storage unit. Basically, as mentioned, committed to the "NAS" as opposed to "Fun Project" path :-). So I could've saved myself some money by getting a simpler unit and not upgrading it. (the DS918+ also can hook up to a cage [DX517], but I ended up not needing that either, yet).

I have it with 4 WD Red Plus NAS 8TTB drives and RAID 10 currently. I've used RAID 5 in the past but decided against it for this usage - again, went for simplicity.

Just shy of 30,000 hours on the drives, daily usage (I basically don't use local drive for any data on any of my computers; I keep it all on NAS and this way I can use any of my computers to do/access the same thing), and really no issues whatsoever so far.


I use a Ubuntu raspberry pi with a cheap usb3 jbod array from Amazon that can hold 5 HDD. I use zfs on it in raidz1. It’s absurdly cheap, can serve about 80 Mb/s on a 1 gbps link, and is entirely sufficient for local backup. I don’t do any offsite. Set up to back up time machine, windows, and zrepl. Runs other services on the pi as well for the home network.

It’s so easy to set up an Ubuntu image that I control completely and I would rather do that than run some questionable 3rd party NAS solution and excluding disks costs about $130.


It's much worse - if the data isn't just a ton of tiny files, and you're able to spin up a bunch of workers for parallelism, you can get up to 120 Gbps per storage account (without going to the extreme of requiring a special quota increase).

That means in a little bit over 5 minutes, the data could have been downloaded by someone. Even most well run security teams won't be able to respond quickly enough for that type of event.


The article mentions that it wasn't a read-only token, meaning you could at least edit and delete files too.