Title: | R6sagemaker debugger for sagemaker operations |
---|---|
Description: | `R6sagemaker` debugger for sagemaker operations. |
Authors: | Dyfan Jones [aut, cre], Amazon.com, Inc. [cph] |
Maintainer: | Dyfan Jones <[email protected]> |
License: | Apache License (>= 2.0) |
Version: | 0.1.0 |
Built: | 2025-01-20 02:39:15 UTC |
Source: | https://github.com/DyfanJones/sagemaker-r-debugger |
'R6sagemaker' debugger for sagemaker operations.
Maintainer: Dyfan Jones [email protected]
Other contributors:
Amazon.com, Inc. [copyright holder]
Base class for action, which is to be invoked when a rule fires. Offers 'serialize' function to convert action parameters to a string dictionary.
new()
This class is not meant to be initialized directly. Accepts dictionary of action parameters and drops keys whose values are 'None'.
Action$new(...)
...
: Dictionary of action parameters.
serialize()
Serialize the action parameters as a string dictionary.
Action$serialize()
Action parameters serialized as a string dictionary.
format()
format class
Action$format()
clone()
The objects of this class are cloneable with this method.
Action$clone(deep = FALSE)
deep
Whether to make a deep clone.
Higher level object to maintain a list of actions to be invoked when a rule is fired.
new()
Offers higher level 'serialize' function to handle serialization of actions as a string list of dictionaries.
ActionList$new(...)
...
: List of actions.
update_training_job_prefix_if_not_specified()
For any StopTraining actions in the action list, update the trainingjob prefix to be the training job name if the user has not already specified a custom training job prefix. This is meant to be called via the sagemaker SDK when 'estimator.fit' is called by the user. Validation is purposely excluded here so that any failures in validation of the training job name are intentionally caught in the sagemaker SDK and not here.
ActionList$update_training_job_prefix_if_not_specified(training_job_name)
training_job_name
: Name of the training job, passed in when 'estimator.fit' is called.
serialize()
Serialize the action parameters as a string dictionary.
ActionList$serialize()
Action parameters serialized as a string dictionary.
format()
format class
ActionList$format()
clone()
The objects of this class are cloneable with this method.
ActionList$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule helps to detect if GPU is underulitized because of the batch size being too small. To detect this the rule analyzes the average GPU memory footprint, CPU and GPU utilization. If utilization on CPU, GPU and memory footprint is on average low , it may indicate that user can either run on a smaller instance type or that batch size could be increased. This analysis does not work for frameworks that heavily over-allocate memory. Increasing batch size could potentially lead to a processing/dataloading bottleneck, because more data needs to be pre-processed in each iteration.
sagemaker.debugger::ProfilerRuleBase
-> BatchSize
new()
Initialize BatchSize class
BatchSize$new( cpu_threshold_p95 = 70, gpu_threshold_p95 = 70, gpu_memory_threshold_p95 = 70, patience = 1000, window = 500, scan_interval_us = 60 * 1000 * 1000 )
cpu_threshold_p95
(numeric): defines the threshold for 95th quantile of CPU utilization.Default is 70%.
gpu_threshold_p95
(numeric): defines the threshold for 95th quantile of GPU utilization.Default is 70%.
gpu_memory_threshold_p95
(numeric): defines the threshold for 95th quantile of GPU memory utilization.Default is 70%.
patience
(numeric): defines how many data points to capture before Rule runs the first evluation. Default 100
window
(numeric): window size for computing quantiles.
scan_interval_us
(numeric): interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
BatchSize$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule helps to detect if GPU is underutilized due to CPU bottlenecks. Rule returns True if number of CPU bottlenecks exceeds a predefined threshold.
sagemaker.debugger::ProfilerRuleBase
-> CPUBottleneck
new()
Initialize CPUBottleneck class
CPUBottleneck$new( threshold = 50, gpu_threshold = 10, cpu_threshold = 90, patience = 1000, scan_interval_us = 60 * 1000 * 1000 )
threshold
: defines the threshold beyond which Rule should return True. Default is 50 percent. So if there is a bottleneck more than 50% of the time during the training Rule will return True.
gpu_threshold
: threshold that defines when GPU is considered being under-utilized. Default is 10%
cpu_threshold
: threshold that defines high CPU utilization. Default is above 90%
patience
: How many values to record before checking for CPU bottlenecks. During training initialization, GPU is likely at 0 percent, so Rule should not check for under utilization immediately. Default 1000.
scan_interval_us
: interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
CPUBottleneck$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule helps to detect how many dataloader processes are running in parallel and whether the total number is equal the number of available CPU cores.
sagemaker.debugger::ProfilerRuleBase
-> Dataloader
new()
Initialize Dataloader class
Dataloader$new( min_threshold = 70, max_threshold = 200, scan_interval_us = 6e+07 )
min_threshold
: how many cores should be at least used by dataloading processes. Default 70%
max_threshold
: how many cores should be at maximum used by dataloading processes. Default 200%
scan_interval_us
: interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
Dataloader$clone(deep = FALSE)
deep
Whether to make a deep clone.
Action for sending an email to the provided email address when the rule is fired. Note that a policy must be created in the AWS account to allow the sagemaker role to send an email to the user.
sagemaker.debugger::Action
-> Email
new()
Initialize Email action class.
Email$new(email_address)
email_address
: Email address to send the email notification to.
clone()
The objects of this class are cloneable with this method.
Email$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule helps to detect large increase in memory usage on GPUs. The rule computes the moving average of continous datapoints and compares it against the moving average of previous iteration.
sagemaker.debugger::ProfilerRuleBase
-> GPUMemoryIncrease
new()
Initialize GPUMemoryIncrease class
GPUMemoryIncrease$new( increase = 5, patience = 1000, window = 10, scan_interval_us = 60 * 1000 * 1000 )
increase
: defines the threshold for absolute memory increase.Default is 5%. So if moving average increase from 5% to 6%, the rule will fire.
patience
: defines how many continous datapoints to capture before Rule runs the first evluation. Default is 1000
window
: window size for computing moving average of continous datapoints
scan_interval_us
: interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
GPUMemoryIncrease$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule helps to detect if GPU is underutilized due to IO bottlenecks. Rule returns True if number of IO bottlenecks exceeds a predefined threshold.
sagemaker.debugger::ProfilerRuleBase
-> IOBottleneck
new()
Initialize IOBottleneck class
IOBottleneck$new( threshold = 50, gpu_threshold = 10, io_threshold = 50, patience = 1000, scan_interval_us = 60 * 1000 * 1000 )
threshold
: defines the threshold when Rule should return True. Default is 50 percent. So if there is a bottleneck more than 50% of the time during the training Rule will return True.
gpu_threshold
: threshold that defines when GPU is considered being under-utilized. Default is 70%
io_threshold
: threshold that defines high IO wait time. Default is above 50%
patience
: How many values to record before checking for IO bottlenecks. During training initilization, GPU is likely at 0 percent, so Rule should not check for underutilization immediatly. Default 1000.
scan_interval_us
: interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
IOBottleneck$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule helps to detect issues in workload balancing between multiple GPUs. It computes a histogram of utilization per GPU and measures the distance between those histograms. If the histogram exceeds a pre-defined threshold then rule triggers.
sagemaker.debugger::ProfilerRuleBase
-> LoadBalancing
new()
Initialize LoadBalancing class
LoadBalancing$new( threshold = 0.5, patience = 1000, scan_interval_us = 60 * 1000 * 1000 )
threshold
: difference between 2 histograms 0.5
patience
: how many values to record before checking for loadbalancing issues
scan_interval_us
: interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
LoadBalancing$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule helps to detect if GPU utilization is low or suffers from fluctuations. This is checked for each single GPU on each worker node. Rule returns True if 95th quantile is below threshold_p95 which indicates under-utilization. Rule returns true if 95th quantile is above threshold_p95 and 5th quantile is below threshold_p5 which indicates fluctuations.
sagemaker.debugger::ProfilerRuleBase
-> LowGPUUtilization
new()
Initialize LowGPUUtilization class
LowGPUUtilization$new( threshold_p95 = 70, threshold_p5 = 10, window = 500, patience = 1000, scan_interval_us = 60 * 1000 * 1000 )
threshold_p95
: threshold for 95th quantile below which GPU is considered to be underutilized. Default is 70 percent.
threshold_p5
: threshold for 5th quantile. Default is 10 percent.
window
: number of past datapoints which are used to compute the quantiles.
patience
: How many values to record before checking for underutilization/fluctuations. During training initilization, GPU is likely at 0 percent, so Rule should not check for underutilization immediately. Default 1000.
scan_interval_us
: interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
LowGPUUtilization$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule helps to detect if the training intialization is taking too much time. The rule waits until first step is available.
sagemaker.debugger::ProfilerRuleBase
-> MaxInitializationTime
new()
Initialize MaxInitializationTime class
MaxInitializationTime$new(threshold = 20, scan_interval_us = 60 * 1000 * 1000)
threshold
: defines the threshold in minutes to wait for first step to become available. Default is 20 minutes.
scan_interval_us
: interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
MaxInitializationTime$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule measures overall system usage per worker node. The rule currently only aggregates values per node and computes their percentiles. The rule does currently not take any threshold parameters into account nor can it trigger. The reason behind that is that other rules already cover cases such as under utilization and they do it at a more fine-grained level e.g. per GPU. We may change this in the future.
sagemaker.debugger::ProfilerRuleBase
-> OverallSystemUsage
new()
Initialize OverallSystemUsage class
OverallSystemUsage$new(scan_interval_us = 60 * 1000 * 1000)
scan_interval_us
: interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
OverallSystemUsage$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule will create a profiler report after invoking all of the rules. The parameters used in any of these rules can be customized by following this naming scheme: <rule_name>_<parameter_name> : value Validation is also done here to ensure that:
The key names follow the above format
rule_name corresponds to a valid rule name.
parameter_name corresponds to a valid parameter of this rule.
The parameter for this rule's parameter is valid.
sagemaker.debugger::ProfilerRuleBase
-> ProfilerReport
new()
Initialize ProfilerReport class
ProfilerReport$new(...)
...
: Dictionary mapping rule + parameter name to value.
clone()
The objects of this class are cloneable with this method.
ProfilerReport$clone(deep = FALSE)
deep
Whether to make a deep clone.
Use the Debugger built-in rules provided by Amazon SageMaker Debugger and analyze tensors emitted while training your models. The Debugger built-in rules monitor various common conditions that are critical for the success of a training job. You can call the built-in rules using Amazon SageMaker Python SDK or the low-level SageMaker API operations. Depending on deep learning frameworks of your choice, there are four scopes of validity for the built-in rules as shown in the following table. https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html
vanishing_gradient() similar_across_runs() weight_update_ratio() all_zero() exploding_tensor() unchanged_tensor() loss_not_decreasing() check_input_images() dead_relu() confusion() tree_depth() class_imbalance() overfit() tensor_variance() overtraining() poor_weight_initialization() saturated_activation() nlp_sequence_ratio() stalled_training_rule() feature_importance_overweight() create_xgboost_report()
vanishing_gradient() similar_across_runs() weight_update_ratio() all_zero() exploding_tensor() unchanged_tensor() loss_not_decreasing() check_input_images() dead_relu() confusion() tree_depth() class_imbalance() overfit() tensor_variance() overtraining() poor_weight_initialization() saturated_activation() nlp_sequence_ratio() stalled_training_rule() feature_importance_overweight() create_xgboost_report()
list to be used in Amazon SageMaker Debugger
Action for sending an SMS to the provided phone number when the rule is fired. Note that a policy must be created in the AWS account to allow the sagemaker role to send an SMS to the user.
sagemaker.debugger::Action
-> SMS
new()
Initialize SMS action class
SMS$new(phone_number)
phone_number
: Valid phone number that follows the the E.164 format. See https://docs.aws.amazon.com/sns/latest/dg/sms_publish-to-phone.html for more info.
clone()
The objects of this class are cloneable with this method.
SMS$clone(deep = FALSE)
deep
Whether to make a deep clone.
This rule helps to detect outlier in step durations. Rule returns True if duration is larger than stddev * standard deviation.
sagemaker.debugger::ProfilerRuleBase
-> StepOutlier
new()
Initialize StepOutlier class
StepOutlier$new( stddev = 3, mode = NULL, n_outliers = 10, scan_interval_us = 60 * 1000 * 1000 )
stddev
: factor by which to multiply the standard deviation. Default is 3
mode
: select mode under which steps have been saved and on which Rule should run on. Per default rule will run on steps from EVAL and TRAIN phase.
n_outliers
: How many outliers to ignore before rule returns True. Default 10.
scan_interval_us
: interval with which timeline files are scanned. Default is 60000000 us.
clone()
The objects of this class are cloneable with this method.
StepOutlier$clone(deep = FALSE)
deep
Whether to make a deep clone.
Action for stopping the training job when a rule is fired.
sagemaker.debugger::Action
-> StopTraining
new()
Note that a policy must be created in the AWS account to allow the sagemaker role to stop the training job.
StopTraining$new(training_job_prefix = NULL)
training_job_prefix
: The prefix of the training job to stop if the rule is fired. This must only refer to one active training job, otherwise no training job will be stopped.
update_training_job_prefix_if_not_specified()
Update the training job prefix to be the training job name if the user has not already specified a custom training job prefix. This is only meant to be called via the sagemaker SDK when 'estimator.fit' is called by the user. Validation is purposely excluded here so that any failures in validation of the training job name are intentionally caught in the sagemaker SDK and not here.
StopTraining$update_training_job_prefix_if_not_specified(training_job_name)
training_job_name
: Name of the training job, passed in when 'estimator.fit' is called.
clone()
The objects of this class are cloneable with this method.
StopTraining$clone(deep = FALSE)
deep
Whether to make a deep clone.