Reinforcement Learning
    The RL problem presented in MLDemos is a Food Gathering problem, in
    which the goal is to provide a policy for navigating a continuous
    2-Dimensional space and pick up food. The states, actions and
    rewards are defined next.
    
    States 
    States are defined as 2-dimensional (x,y) ∈ R2 positions in the
    canvas space. The space is
    continuous, and bound between [0., 1.] for practical purposes.
    
    Actions
    Actions are defined as movement from one state to another, following
    a set of possible direction (defined by the user). The sets of
    possible actions from each states are:
    
      - 4-way: movement along
        either the horizontal or the vertical axis
- 8-way: movement along
        the horizontal or vertical axes or diagonally at a 45° angle
- Full: movement along
        an arbitrary direction θ (θ ∈ [02π])
In all cases, an additional action ”wait” allows to not move.
    
    Rewards
    The State-Value function is computed in a cumulative way by
    considering how much food is collected throughout a trajectory from
    any given initial state, for a number of Evaluation Steps (defined
    by the user).
    
      -  Sum of Rewards: Sum
        of all the food present in the current state at each step of the
        trajectory. food can be collected multiple times from the same
        state.
- Sum (Non Repeatable):
        Same as Sum of Rewards, but when food is collected at a specific
        step, the food at that location is erased and cannot be
        collected again.
- Sum - Harsh Turns:
        Same as above, but a penalty is given if the latest action taken
        was at more than 90◦ from the previous one (harsh turn), in
        which case, no food is provided for that step.
- Average of Rewards:
        The amount of food collected is averaged by the number of steps
        taken over the whole trajectory.
The state-value function is evaluated at each policy-optimization
    iteration for a number of states corresponding to the number of
    basis functions, initialized at the state corresponding to the
    center of the basis function (grid).
    
    Policies
    Three policies have been implemented in MLDemos. In all cases, the
    policy determines what action will be taken from each state using a
    grid-like distribution of basis functions. The action taken from a
    specific state will be ”influenced” by the policy using 3 different
    paradigms:
    
      - Nearest Neighbors: the
        action taken from each state depends entirely on the direction
        suggested by the nearest basis function.
- Linear combination:
        the action taken from each state is computed as a linear
        combination of the closest basis functions, each weighed as a
        function of their proximity (inverse euclidian distance).
- Gaussians: the action
        taken from each state is computed as a linear combination of the
        closest basis functions, each weighed as a function of their
        proximity (gaussian function, with sigma equal to the distance
        between basis functions).
The first case is a peculiar case in that, while the states space is
    continuous, the policy provides the exact same action for a set of
    states, which makes it a somewhat discretized problem. The other two
    policies provide a continuous set of actions for a continuous states
    space and therefore pose no problems of a somewhat ontological
    nature.
    
    In Practice
    The easiest way to test the reinforcement learning process is to:
    
      - Use the Reward Painter button in the drawing tools to paint
        food (red) onto the canvas
- Click the Initialize button to start the learning process
This will start the RL process, display the policy basis functions
    and update them every Display Steps iterations.
    
    Options and Commands
    The interface for Reinforcement Learning (the right-hand side of the
    Algorithm Options dialog) provides the following commands:
    
      - Initialize: Initialize the RL problem and start the learning
- Pause / Continue: pause the learning process (stops animation
        as well)
- Clear: clear the current classifier model (does NOT clear the
        data)
- Drag Me: (for display purposes only) display the evaluation
        steps for an agent at a specific position (drag and drop onto
        the canvas)
- X: erase all displayed agents
The options regarding the policy type, reward and evaluation have
    been described above.
    
    Generate Rewards
    It is possible to generate a set of pre-constructed rewards by
    dragging and dropping a gaussian of fixed size (Var option) or a
    gradient from the center of the canvas to the dropped position.
    Alternatively a number of standard benchmark functions is proposed.
    Use the Set button to draw the benchmark function onto the canvas.