boax.policies.upper_confidence_bound

boax.policies.upper_confidence_bound#

boax.policies.upper_confidence_bound(confidence)#

The upper confidence bound policy function.

Selects the variant with highest action-value plus the upper confidence bound.

Example

>>> policy = upper_confidence_bound(confidence)
>>> variant = policy(params, timestep, key)
Parameters:

confidence (Union[Array, ndarray, bool, number, float, int]) – The confidence parameter guiding exploration vs exploitation.

Return type:

Policy[ActionValues]

Returns:

The corresponding Policy.