Subscribe: Comments for Dave Pacheco's Blog
http://dtrace.org/blogs/dap/comments/feed/
Preview: Comments for Dave Pacheco's Blog

Comments for Dave Pacheco's Blog



Systems software engineering



Last Build Date: Sun, 28 Aug 2016 17:21:43 +0000

 



Comment on TCP puzzlers by Нуга и ненужность — Episode 0107 « DevZen Podcast

Sun, 28 Aug 2016 17:21:43 +0000

[...] TCP puzzlers | Dave Pacheco’s Blog [...]



Comment on Fault tolerance in Manta by dap

Thu, 18 Jul 2013 15:59:45 +0000

Alfonso, Yes, if there's a partition that affects the storage node for an ongoing PUT operation, that operation will completely fail and you'll have to retry it from the client. The metadata update (i.e., linking /$MANTA_USER/stor/foo to the newly stored object) is only done once the data is stored in the backend, so that step will never happen in this case. It will be like the request never happened. The PUT operation is atomic across all instances in a single shard. (We don't support server-side operations that would modify multiple shards.) You can do as many operations as you want concurrently, with the usual semantics: exactly one of them will happen "last" and become the final state. You can use HTTP conditional PUT (using etags) to avoid multiple clients stomping on each other. Hope that helps! -- Dave



Comment on Fault tolerance in Manta by Alfonso

Thu, 18 Jul 2013 08:25:01 +0000

Distributed Systems are always comples due to the different points of failure ;). Glad to hear you use Zookeeper. So if during a put on an object, a network partition happens where the copy is first overwritte, so the AZ is disconected from the rest, the operation will fail and I would have to try again to update one of the other replicas. When the network partition is fixed, then zookeeper plays its role I presume. Is the "PUT" operation atomic across all the instances? so If an instance is updated, until all the replicas are replaced, I cannot do another "PUT" or if there is an orthering in the operations if a new PUT comes before all the replicas are updated, as it is a new version of the objetc, the later PUT operation overrides the running one. I know that this will not be the "normal" use case for manta. Thanks Dave.



Comment on Fault tolerance in Manta by dap

Wed, 17 Jul 2013 13:59:47 +0000

Alfonso, Remember that Manta only supports replacing an existing object with a different one, not updating some part of the original object. Replacing an object is just a PUT of a new object with the same name. In the case of a partition, the partitioned AZ will generally be completely removed from service: its frontend loadbalancers will be removed from DNS, and internal services in the other AZs will stop trying to use the services they cannot reach. As a result, incoming requests will hit the non-partitioned AZs and succeed because the metadata tier maintains write availability as long as a majority of instances for each shard are available. (As I alluded to, the way it actually does this is complex. We're using Zookeeper to figure out what "majority" means, and we wrote software to update the read-write state of the database in response to changes in the topology.) In the unlikely event that a request does hit the partitioned AZ, the request would get a 503 because the metadata tier refuses updates when there's only one instance of a shard available. So there's nothing to resolve: objects are immutable, and changes to the metadata tier are ACID. Does that answer it? Thanks, Dave



Comment on Fault tolerance in Manta by Alfonso

Wed, 17 Jul 2013 07:53:25 +0000

How do you deal if there are 2 updates on different replicas of the same object? Imagine a network partition. What happens if the objetc gets 2 different updates, then the network partition is restored. Do ou have any distributed locking mechanism? So we can guaranteed that the "transaction" over an object is ACID?



Comment on Fault tolerance in Manta by Mark Cavage

Thu, 04 Jul 2013 15:31:58 +0000

Hi Andy, Yes, good question. Right now we don't rebalance data for any reason besides failure. What we've talked about for this is actually giving customers some sample jobs (and perhaps an extra magical hook) such that you can run a Manta compute job on your data to do any of rebalance/(increase|decrease) durability/reconstitute etc, which is in line with your closing thought on Joyent delegating to the customer. Combined with that, we've also talked about letting you specify a compute profile when you write data (i.e., you could say you wanted to write to a heavy CPU skew box, or SSD, or ...); pricing incentives would dictate to users how they want to write and store. The honest answer here is there's a lot we *can* do, but we're sort of waiting and seeing what we should/will do from feedback like this question and what we see from actual usage data. m



Comment on Fault tolerance in Manta by Mark Cavage

Thu, 04 Jul 2013 15:20:57 +0000

Hello Tushar Jain, Yes your understanding is correct. We have debated whether or not to just set a "floor of 2" for number of AZs or min(num_azs, num_copies) -- right now it's the latter, as we believe that's the "intended semantics" of customers when setting number of copies > 2, and since total AZ failure is (very) uncommon. If this is a frequent request or problem, I think we'd add a small feature to explicitly let you control the minimum number of AZs, instead of us inferring. Lastly, there's no mention thus far of the durability guarantee (more on that in a future post), but note that 2 copies in manta is both price competitive with other object stores, and offers an astronomically high durability level. That doesn't change your questions, or the answers, but it seems worth pointing out that most customers don't need/want > 2 copies in Manta. m



Comment on Fault tolerance in Manta by Andrew Leonard

Thu, 04 Jul 2013 14:17:10 +0000

I understand the decision not to automatically re-replicate in the event of a failure - that seems reasonable. I am curious about how you plan to rebalance data for performance reasons - will that be done manually or automatically? The dilemma I see is that manual rebalancing won't scale, while automatic balancing would likely introduce new complex failure modes. (I'm assuming that across your fleet of storage nodes, some will contain hotter data than others, perhaps to the point that some are continually congested with compute jobs, while others sit idle. My experience is that old data is generally colder than new data, which could result in low average compute utilization - without rebalancing, newly deployed systems will be the busiest, while older systems grow increasingly idle.) Perhaps rebalancing could be delegated from Joyent operations to the customer - maybe a metric could be exposed to the user when an object could benefit from being rebalanced, and the customer could request it themselves if desired - or a property could be added within job configuration files along the lines of "rebalance my data if it will get my job done more quickly." -Andy



Comment on Fault tolerance in Manta by Tushar Jain

Wed, 03 Jul 2013 20:51:25 +0000

Great answers guys. Thanks. One more follow up. If my understanding is correct, then it seems that currently, if I set durability level to 3 and a single AZ is unavailable, then all my PUTs will 500. I'm reaching this conclusion based on the following points you guys mentioned: * All N writes have to succeed for me to get a 200 * Manta is currently in 3 AZs and the first 3 copies are stored in separate availability zones. So, am I correct in concluding that if I set my durability level to 3, my PUTs cannot withstand a single AZ failure or inter-az partition? GETs should still succeed since you'll be able to fetch the data from one of the other AZs. Thanks!



Comment on Fault tolerance in Manta by Mark Cavage

Wed, 03 Jul 2013 20:32:35 +0000

Hello Tushar Jain, (1) We return 200 only upon both copies being durably on disk, and the metadata being durably on disk about where your object is. That is, once you get a 200, you've got $DurabilityLevel guaranteed, and the ability to read it. So yes, all N writes have to succeed for you to get a 200. To the read-after-write-on-error, if you got a 500 on a PUT, we now have some set of "orphan data" on storage nodes, but we won't have stashed a metadata record for it, so a subsequent GET will return 404 (or the previous version of the object if it was an overwrite). We asynchronously clean that up, but it's not your problem. (2) As Dave said, we don't automatically re-replicate, but we will do ao if we can't recover the host by operator means (i.e., it's far less invasive to replace a blown PDU for example than to rewrite all the data on a dense storage box). Your data will be returned to the advertised durability level "as fast as we can". The only exception to this is if you set durability level to one, and the storage pool is actually toast, in which case you would have data loss.