ReplicaDrop
Move Type: Advanced Complexity: O(G × R) where G = groups, R = replicas per group Primary Use: Intelligent replica reduction when over-replicated
Intelligently select which replicas to drop when reducing replication levels. Compares all replicas in a group and chooses the best ones to remove.
Overview
ReplicaDrop (also known as REPLICA_DROP) is a specialized move type for scenarios where you have more replicas than needed and must decide which ones to drop. Instead of randomly removing replicas, it:
- Identifies all replicas in the same group (e.g., all tasks in a job)
- Evaluates moving each replica out of the specified scope
- Compares the objective for each option
- Selects the best replica to drop based on objectives
This ensures that when reducing replication, you keep the best replicas and drop the worst ones.
Use when:
- Over-replicated shards/jobs need reduction
- Must intelligently choose which replicas to drop
- Have objectives that distinguish replica quality (lag, load, etc.)
- Downsizing while maintaining quality
Avoid when:
- No over-replication
- All replicas are equivalent
- Don't have quality metrics to compare
- Need to add replicas (not remove them)
Quick Example
- Python
- C++
# Reduce job from 4 tasks to 2, dropping worst tasks
solver.addConstraint(
ConstraintSpecs(
groupCountSpec=GroupCountSpec(
scope="assigned",
dimension="task_count",
partitionName="job",
bound=GroupCountSpecBound.EXACT,
limit=GroupLimitSpec(defaultLimit=2),
)
)
)
solver.addSolver(
SolverSpecs(
localSearchSolverSpec=LocalSearchSolverSpec(
moveTypeList=[
ReplicaDropMoveTypeSpec(
replicaDropPartition="job",
replicaDropScope="assigned",
),
]
)
)
)
void quickExample() {
// Reduce job from 4 tasks to 2, dropping worst tasks
auto executor = std::make_shared<folly::CPUThreadPoolExecutor>(4);
ProblemSolver solver(executor, "example", "test");
solver.setObjectName("task");
solver.setContainerName("host");
// Setup assignment so tasks exist before referencing them in partition
solver.setAssignment(
std::map<std::string, std::vector<std::string>>{
{"host1", {"task0", "task1"}},
{"host2", {"task2"}},
{"host3", {"task3"}},
});
// Define jobs with tasks
std::unordered_map<std::string, std::vector<std::string>> jobs = {
{"job0", {"task0", "task1", "task2", "task3"}},
};
solver.addPartition("job", jobs);
// Define assigned scope
std::map<std::string, std::string> assignedScope = {
{"host1", "assigned"},
{"host2", "assigned"},
{"host3", "assigned"},
};
solver.addScope("assigned", assignedScope);
LocalSearchSolverSpec solverSpec;
ReplicaDropMoveTypeSpec replicaDropSpec;
replicaDropSpec.replicaDropPartition() = "job";
replicaDropSpec.replicaDropScope() = "assigned";
solverSpec.moveTypeList()->emplace_back();
solverSpec.moveTypeList()->back().replicaDropMoveTypeSpec() = replicaDropSpec;
solver.addSolver(solverSpec);
}
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
replicaDropPartition | string | Yes | null | Partition defining replica groups |
replicaDropScope | string | Yes | null | Scope to move replicas OUT OF |
Parameter Details
replicaDropPartition:
- The partition name defining replica groups
- Example: "job" partition groups tasks by job
- Example: "shard" partition groups replicas by shard
- Each group represents a set of replicas to compare
replicaDropScope:
- The scope from which replicas should be moved OUT
- Example: "assigned" scope (move to outside, e.g., "unassigned")
- Replicas are moved from inside this scope to containers outside it
- Typically used with special unassigned/dropped container
How It Works
For each hot container in the scope:
- Identify objects: Find objects in hot container within the drop scope
- Group by partition: Determine which replica group each object belongs to
- For each group with object in hot container:
- Evaluate all replicas: Try moving each replica in the group out of scope
- Compare objectives: Calculate objective change for each replica drop
- Select best: Choose replica whose removal improves objective most
- Return best move: Return the single best replica drop found
Visual Example
Job "job0" has 4 tasks (over-replicated, need only 2):
- task0: lag = 0.00 (good)
- task1: lag = 0.01 (okay)
- task2: lag = 0.03 (bad)
- task3: lag = 0.02 (not great)
Goal: Minimize total lag, keep exactly 2 tasks assigned
ReplicaDrop evaluation:
Option 1: Drop task0 → Remaining lag: 0.01 + 0.03 + 0.02 = 0.06
Option 2: Drop task1 → Remaining lag: 0.00 + 0.03 + 0.02 = 0.05
Option 3: Drop task2 → Remaining lag: 0.00 + 0.01 + 0.02 = 0.03 ✓ Best!
Option 4: Drop task3 → Remaining lag: 0.00 + 0.01 + 0.03 = 0.04
Choose to drop task2 (highest lag)
After 2 drops:
Keeps: task0 (lag 0.00), task1 (lag 0.01)
Dropped: task2 (lag 0.03), task3 (lag 0.02)
Complexity
Per iteration: O(G × R)
Where:
- G = number of groups with objects in hot container
- R = average replicas per group
Example calculation:
- Groups in hot container: 10
- Replicas per group: 4
- Evaluations: 10 × 4 = 40
Typical scenario:
- 100 shards (groups)
- 5 replicas per shard
- Want to reduce to 3 replicas per shard
- Iterations needed: 2 drops per shard × 100 shards = 200 iterations
- Evaluations per iteration: ~5 (compare all replicas)
- Total: ~1,000 evaluations
Usage Patterns
Job Downsizing
Reduce job task count while minimizing lag:
- Python
- C++
# Downsize jobs to specific task counts, dropping highest-lag tasks
solver = ProblemSolver(service_name="example", service_scope="test")
solver.setObjectName("task")
solver.setContainerName("host")
# Define jobs
solver.addPartition(
"job",
{
"job0": ["job0_task0", "job0_task1", "job0_task2", "job0_task3"],
"job1": ["job1_task0", "job1_task1"],
},
)
# Define assigned scope
solver.addScope(
"assigned",
{
"host1": "assigned",
"host2": "assigned",
"host3": "assigned",
"host4": "assigned",
},
)
# Add lag dimension to distinguish task quality
solver.addObjectDimension(
"lag",
{
"job0_task0": 0.0,
"job0_task1": 0.01,
"job0_task2": 0.03,
"job0_task3": 0.02,
"job1_task0": 0.06,
"job1_task1": 0.0,
},
)
solver.addConstraint(
ConstraintSpecs(
groupCountSpec=GroupCountSpec(
scope="assigned",
dimension="task_count",
partitionName="job",
bound=GroupCountSpecBound.EXACT,
limit=GroupLimitSpec(groupLimits={"job0": 2, "job1": 1}),
)
)
)
solver.addSolver(
SolverSpecs(
localSearchSolverSpec=LocalSearchSolverSpec(
moveTypeList=[
ReplicaDropMoveTypeSpec(
replicaDropPartition="job",
replicaDropScope="assigned",
),
]
)
)
)
void jobDownsizing() {
// Downsize jobs to specific task counts, dropping highest-lag tasks
auto executor = std::make_shared<folly::CPUThreadPoolExecutor>(4);
ProblemSolver solver(executor, "example", "test");
solver.setObjectName("task");
solver.setContainerName("host");
// Define jobs
std::unordered_map<std::string, std::vector<std::string>> jobs = {
{"job0", {"job0_task0", "job0_task1", "job0_task2", "job0_task3"}},
{"job1", {"job1_task0", "job1_task1"}},
};
solver.addPartition("job", std::move(jobs));
// Define assigned scope
std::map<std::string, std::string> assignedScope = {
{"host1", "assigned"},
{"host2", "assigned"},
{"host3", "assigned"},
{"host4", "assigned"},
};
solver.addScope("assigned", assignedScope);
// Add lag dimension to distinguish task quality
std::map<std::string, double> lagDimension = {
{"job0_task0", 0.0},
{"job0_task1", 0.01},
{"job0_task2", 0.03},
{"job0_task3", 0.02},
{"job1_task0", 0.06},
{"job1_task1", 0.0},
};
solver.addObjectDimension("lag", std::move(lagDimension));
GroupCountSpec groupCountSpec;
groupCountSpec.scope() = "assigned";
groupCountSpec.dimension() = "task_count";
groupCountSpec.partitionName() = "job";
groupCountSpec.bound() = GroupCountSpecBound::EXACT;
groupCountSpec.limit()->groupLimits() = {{"job0", 2}, {"job1", 1}};
solver.addConstraint(std::move(groupCountSpec));
LocalSearchSolverSpec solverSpec;
ReplicaDropMoveTypeSpec replicaDropSpec;
replicaDropSpec.replicaDropPartition() = "job";
replicaDropSpec.replicaDropScope() = "assigned";
solverSpec.moveTypeList()->emplace_back();
solverSpec.moveTypeList()->back().replicaDropMoveTypeSpec() = replicaDropSpec;
solver.addSolver(solverSpec);
}
Shard Replica Reduction
Reduce shard replication from 5x to 3x:
- Python
- C++
# Reduce from 5 replicas to 3 replicas per shard
solver.addConstraint(
ConstraintSpecs(
groupCountSpec=GroupCountSpec(
scope="active",
dimension="replica_count",
partitionName="shard",
bound=GroupCountSpecBound.EXACT,
limit=GroupLimitSpec(defaultLimit=3), # 3 replicas per shard
)
)
)
solver.addSolver(
SolverSpecs(
localSearchSolverSpec=LocalSearchSolverSpec(
moveTypeList=[
ReplicaDropMoveTypeSpec(
replicaDropPartition="shard",
replicaDropScope="active",
),
]
)
)
)
void shardReduction() {
// Reduce from 5 replicas to 3 replicas per shard
auto executor = std::make_shared<folly::CPUThreadPoolExecutor>(4);
ProblemSolver solver(executor, "example", "test");
solver.setObjectName("replica");
solver.setContainerName("node");
// Define shards with 5 replicas each
std::unordered_map<std::string, std::vector<std::string>> shards = {
{"shard1",
{"shard1_r1", "shard1_r2", "shard1_r3", "shard1_r4", "shard1_r5"}},
{"shard2",
{"shard2_r1", "shard2_r2", "shard2_r3", "shard2_r4", "shard2_r5"}},
};
solver.addPartition("shard", std::move(shards));
// Define active scope
std::map<std::string, std::string> activeScope = {
{"node1", "active"},
{"node2", "active"},
{"node3", "active"},
{"node4", "active"},
{"node5", "active"},
};
solver.addScope("active", activeScope);
GroupCountSpec groupCountSpec;
groupCountSpec.scope() = "active";
groupCountSpec.dimension() = "replica_count";
groupCountSpec.partitionName() = "shard";
groupCountSpec.bound() = GroupCountSpecBound::EXACT;
groupCountSpec.limit()->globalLimit() = 3; // 3 replicas per shard
solver.addConstraint(std::move(groupCountSpec));
LocalSearchSolverSpec solverSpec;
ReplicaDropMoveTypeSpec replicaDropSpec;
replicaDropSpec.replicaDropPartition() = "shard";
replicaDropSpec.replicaDropScope() = "active";
solverSpec.moveTypeList()->emplace_back();
solverSpec.moveTypeList()->back().replicaDropMoveTypeSpec() = replicaDropSpec;
solver.addSolver(solverSpec);
}
Quality-Based Selection
Drop replicas based on quality metrics:
- Python
- C++
# Keep 2 freshest replicas, drop stale ones
solver = ProblemSolver(service_name="example", service_scope="test")
solver.setObjectName("replica")
solver.setContainerName("server")
# Define replicas
solver.addPartition(
"data", {"dataset1": ["replica1", "replica2", "replica3", "replica4"]}
)
# Define online scope
solver.addScope(
"online",
{
"server1": "online",
"server2": "online",
"server3": "online",
},
)
# Add staleness dimension (lower is better)
solver.addObjectDimension(
"staleness",
{
"replica1": 0.0, # Fresh
"replica2": 5.0, # Slightly stale
"replica3": 20.0, # Very stale
"replica4": 2.0, # Mostly fresh
},
)
solver.addConstraint(
ConstraintSpecs(
groupCountSpec=GroupCountSpec(
scope="online",
dimension="replica_count",
partitionName="data",
bound=GroupCountSpecBound.EXACT,
limit=GroupLimitSpec(defaultLimit=2),
)
)
)
solver.addSolver(
SolverSpecs(
localSearchSolverSpec=LocalSearchSolverSpec(
moveTypeList=[
ReplicaDropMoveTypeSpec(
replicaDropPartition="data",
replicaDropScope="online",
),
]
)
)
)
void qualityBased() {
// Keep 2 freshest replicas, drop stale ones
auto executor = std::make_shared<folly::CPUThreadPoolExecutor>(4);
ProblemSolver solver(executor, "example", "test");
solver.setObjectName("replica");
solver.setContainerName("server");
// Define replicas
std::unordered_map<std::string, std::vector<std::string>> data = {
{"dataset1", {"replica1", "replica2", "replica3", "replica4"}},
};
solver.addPartition("data", std::move(data));
// Define online scope
std::map<std::string, std::string> onlineScope = {
{"server1", "online"},
{"server2", "online"},
{"server3", "online"},
};
solver.addScope("online", onlineScope);
// Add staleness dimension (lower is better)
std::map<std::string, double> stalenessDimension = {
{"replica1", 0.0}, // Fresh
{"replica2", 5.0}, // Slightly stale
{"replica3", 20.0}, // Very stale
{"replica4", 2.0}, // Mostly fresh
};
solver.addObjectDimension("staleness", std::move(stalenessDimension));
GroupCountSpec groupCountSpec;
groupCountSpec.scope() = "online";
groupCountSpec.dimension() = "replica_count";
groupCountSpec.partitionName() = "data";
groupCountSpec.bound() = GroupCountSpecBound::EXACT;
groupCountSpec.limit()->globalLimit() = 2;
solver.addConstraint(std::move(groupCountSpec));
LocalSearchSolverSpec solverSpec;
ReplicaDropMoveTypeSpec replicaDropSpec;
replicaDropSpec.replicaDropPartition() = "data";
replicaDropSpec.replicaDropScope() = "online";
solverSpec.moveTypeList()->emplace_back();
solverSpec.moveTypeList()->back().replicaDropMoveTypeSpec() = replicaDropSpec;
solver.addSolver(solverSpec);
}
Performance Characteristics
When Does It Help?
ReplicaDrop helps when:
- Over-replication: More replicas than needed
- Intelligent selection: Quality/metrics distinguish replicas
- Downsizing: Reducing replication levels
- Cost optimization: Removing least valuable replicas
- Resource constraints: Must reduce replica count
ReplicaDrop does NOT help when:
- Not over-replicated: Have correct number of replicas
- All replicas equivalent: No basis to choose which to drop
- Adding replicas: Need to increase replication (not decrease)
- No quality metrics: Can't distinguish good from bad replicas
- Random drops acceptable: Don't need intelligent selection
Comparison with Manual Approaches
| Approach | Selection | Quality | Use Case |
|---|---|---|---|
| Random drop | Random | Variable | No quality metric |
| Round-robin drop | Deterministic | Variable | Equal replicas |
| ReplicaDrop | Intelligent | Optimal | Quality-aware downsizing |
Comparison with Alternatives
| Move Type | Purpose | Replica Awareness | Use Case |
|---|---|---|---|
| Single | General moves | No | Standard optimization |
| FixedSource | Move from specific | No | Targeted moves |
| ReplicaDrop | Reduce replicas | Yes | Intelligent downsizing |
Troubleshooting
Problem: No moves found
Diagnosis: May not have over-replication or all replicas outside scope
Solutions:
- Check replica counts per group
- Verify replicas are in the drop scope
- Ensure some replicas can be moved out
- Review constraints blocking moves
Problem: Wrong replicas being dropped
Diagnosis: Objectives not reflecting replica quality
Solutions:
- Add dimension capturing replica quality (lag, staleness, etc.)
- Add goal minimizing that dimension
- Check goal weights and priorities
- Verify dimension values are correct
Problem: Drops too slow
Diagnosis: Many groups or large groups
Solutions:
- Check number of groups and replicas per group
- May need to batch drops
- Consider if all groups need reduction simultaneously
- Review time limits
Problem: Constraint violations
Diagnosis: Dropping replicas violates constraints
Solutions:
- Check GroupCount constraints are correct
- Verify capacity constraints allow reduction
- Review anti-affinity or colocation constraints
- May need to adjust constraints for downsizing
When to Use ReplicaDrop
DO use when:
- Reducing replication levels
- Have quality metrics for replicas
- Need intelligent replica selection
- Over-replicated and must downsize
- Want to keep best replicas
DO NOT use when:
- Not over-replicated
- All replicas are equivalent
- No quality metrics available
- Increasing replication (not decreasing)
- Random selection is acceptable
Related Move Types
Complementary move types:
- Single - General optimization after downsizing
- FixedSource - Targeted moves from specific sources
Related concepts:
- GroupCount constraint - Control replica counts
- Scope-based placement - Define where replicas can go
Source Code
- Thrift definition:
interface/thrift/SolverSpecs.thrift:678 - Implementation:
solver/moves/ReplicaDropMoveType.h - Tests:
interface/tests/ReplicaDropTest.cpp
Next Steps
- Learn about
GroupCountconstraint for controlling replica counts - Try Single for general optimization
- Review Move Types Overview for choosing move types
Notes
⚠️ Requires Quality Metrics: ReplicaDrop only helps if you have objectives/dimensions that distinguish replica quality. Without quality metrics, all replicas look the same and selection is arbitrary.
💡 Typical Pattern: Use with GroupCount constraint specifying target replica counts, and goals minimizing quality dimensions (lag, staleness, load, etc.).