Drake
Drake C++ Documentation
glTF Render Client-Server API

Overview

Drake offers built-in renderers (RenderEngineVtk, RenderEngineGl), but in some cases users may want to use their own custom rendering implementations. One way to accomplish that is to subclass RenderEngine with a custom implementation and link that into Drake, but sometimes that leads to linker compatibility problems. Thus, we also offer another option: rendering using a remote procedure call (RPC). This document specifies the network API for servicing those requests.

The glTF render server API consists of two components: a client, and a server. The client generates and transmits glTF scene files to the server, which then renders and returns an image file back to the client. Drake implements the client side, which can be constructed via MakeRenderEngineGltfClient(). The server must respond with images of a certain type and dimension (width and height), but is otherwise free to produce any rendering desired in however much time it needs to render.

The primary design goal for this RPC architecture is for batch-mode dataset generation (e.g., for machine learning), not for realtime simulation. We transmit the entire scene its entirety for each RPC request; the protocol does not support incremental changes to the scene.

The server is welcome to add in fixed scene elements beyond what's transmitted (e.g., the server could add in a building's walls and floor, rather than having the client transmit that same immutable geometry in every RPC). However, for anything dynamic the design intent is that every RPC call is comprehensive on its own.

Contents:

Server API


A given server implementation is required to implement a "Render Endpoint" to which the client will issue a POST request with an html <form> and then await a rendered image response. The POST from the client will include uploading the scene file, in addition to a variety of other metadata attributes related to the image being rendered (such as width and height), as well as the full specification of the systems::sensors::CameraInfo intrinsics being rendered. The scene file must use the glTF 2.0 file format and must only use embedded assets (with data: URIs). A future version of this protocol might allow for non-embedded assets.

The server is expected to block (delay sending a response) until it is ready to transmit the final rendered image back to the client. This provides for an easier implementation of the server, as well as this single endpoint is the only form of communication the client will initiate with the server. If the server fails to respond with a valid image file, the client will produce an error.

The server is required to send a valid HTTP response code in both the event of success and failure. If an HTTP response code indicating success (2XX and 3XX) is received, the client will further require an image response of a supported image type with the correct dimensions. Otherwise, the client will attempt to populate an error message from the file response (if available). As a special note about flask in particular, if the server implementation doesn't return any code, the default response is always 200 which indicates success. When in doubt, have the server respond with 200 for a success, 400 for any failure related to bad requests (e.g., missing min_depth or max_depth for an image_type="depth"), and 500 for any unhandled errors.

Render Endpoint


The render endpoint (by default: /render) is responsible for receiving an uploaded scene file, rendering the scene, and transmitting an image back to the client. In addition to the scene file, the render endpoint is provided with the full specification of the systems::sensors::CameraInfo object, and optionally the depth range of the systems::sensors::DepthRange object.

Render Endpoint <form> Data


The client will POST a <form> with an enctype=multipart/form-data to the server with the following field entries:

Field Name Field Description
scene The scene file contents. Sent as if from <input type="file" name="scene">. glTF scenes will have a mime type of model/gltf+json.
scene_sha256 The sha256 hash of the scene file being uploaded. The server may use this entry to validate that the full file was successfully uploaded. Sent as form data <input type="text" name="scene_sha256">.
image_type The type of image being rendered. Its value will be one of: "color", "depth", or "label". Sent as form data <input type="text" name="image_type">.
min_depth The minimum depth range as specified by a depth sensor's DepthRange::min_depth(). Only provided when image_type="depth". Sent as form data <input type="number" name="min_depth">. Decimal value.
max_depth The maximum depth range as specified by a depth sensor's DepthRange::max_depth(). Only provided when image_type="depth". Sent as form data <input type="number" name="max_depth">. Decimal value.
width Width of the desired rendered image in pixels as specified by the systems::sensors::CameraInfo::width() being rendered, the server must respond with an image of the same width. Sent as form data <input type="number" name="width">. Integral value.
height Height of the desired rendered image in pixels as specified by the systems::sensors::CameraInfo::height() being rendered, the server must respond with an image of the same height. Sent as form data <input type="number" name="height">. Integral value.
near The near clipping plane of the camera as specified by the RenderCameraCore's ClippingRange::near() value. Sent as form data <input type="number" name="near">. Decimal value.
far The far clipping plane of the camera as specified by the RenderCameraCore's ClippingRange::far() value. Sent as form data <input type="number" name="far">. Decimal value.
focal_x The focal length x, in pixels, as specified by the systems::sensors::CameraInfo::focal_x() value. Sent as form data <input type="number" name="focal_x">. Decimal value.
focal_y The focal length y, in pixels, as specified by the systems::sensors::CameraInfo::focal_y() value. Sent as form data <input type="number" name="focal_y">. Decimal value.
fov_x The field of view in the x-direction (in radians) as specified by the systems::sensors::CameraInfo::fov_x() value. Sent as form data <input type="number" name="fov_x">. Decimal value.
fov_y The field of view in the y-direction (in radians) as specified by the systems::sensors::CameraInfo::fov_y() value. Sent as form data <input type="number" name="fov_y">. Decimal value.
center_x The principal point's x coordinate in pixels as specified by the systems::sensors::CameraInfo::center_x() value. Sent as form data <input type="number" name="center_x">. Decimal value.
center_y The principal point's y coordinate in pixels as specified by the systems::sensors::CameraInfo::center_y() value. Sent as form data <input type="number" name="center_y">. Decimal value.

Note: {focal_x, focal_y} contains duplicated information as {fov_x, fov_y} given an image size. Nonetheless, both pieces of information are sent to the server so that the server can choose its preferred representation. The server can expect them to be consistent.

Allowed Image Response Types


The client accepts the following image types from a server render:

Notes on glTF Camera Specification


For a glTF scene file, note that there are two locations that describe the camera:

  1. The "cameras" array, which specifies the camera projection matrix. The client will always produce a length one "cameras" array, with a single entry ("camera 0"). This camera will always be of "type": "perspective", and its "aspectRatio", "yfov", "zfar", and "znear" attributes will accurately represent the drake sensor. However, note that the glTF perspective projection definition does not include all of the information present in the matrix that would be obtained by RenderCameraCore::CalcProjectionMatrix. While the two matrices will be similar, a given render server must decide based off its choice of render backend how it wishes to model the camera perspective projection transformation – utilize the glTF definition, or incorporate the remainder of the <form> data to construct its own projection matrix. A sample snippet from a client glTF file:
     {
       "cameras" :
       [
         {
           "perspective" :
           {
             "aspectRatio" : 1.3333333333333333,
             "yfov" : 0.78539816339744828,
             "zfar" : 10,
             "znear" : 0.01
          },
          "type" : "perspective"
         }
       ],
     }
    
  2. The "nodes" array, which specifies the camera's global transformation matrix. The "camera": 0 entry refers to the index into the "cameras" array from (1). Note that this is not the "model view transformation" (rather, its inverse) – it is the camera's global transformation matrix which places the camera in the world just like any other entry in "nodes". Note that the "matrix" is presented in column-major order, as prescribed by the glTF specification. A sample snippet of the camera node specification in the "nodes" array that has no rotation (the identity rotation) and a translation vector [x=0.1, y=0.2, z=0.3] would be provided as:
     {
       "nodes" :
       [
         {
           "camera" : 0,
           "matrix" :
           [
             1.0,
             0.0,
             0.0,
             0.0,
             0.0,
             1.0,
             0.0,
             0.0,
             0.0,
             0.0,
             1.0,
             0.0,
             0.1,
             0.2,
             0.3,
             1.0,
           ],
           "name" : "Camera Node"
         },
       ],
     }
    

Notes on Communicating Errors


When errors occur on the server side, the server should explicitly return an HTTP response code indicating a failed transaction.

Additionally, the server should clearly communicate why there was an upload or render failure as plain text in the file response. Though this is not strictly required, the user of the server will have no hints as to what is going wrong with the client-server communication. When the file response is provided, this information will be included in the exception message produced by the client.

Notes on Rendering Label Images


Renderers typically can't render objects with "labels". Drake encodes the labels associated with geometries as unique colors and provides those colors to the server as attributes on the meshes. Thus, the label output from any server will be an RGB or RGBA PNG.

All renderable artifacts that exist only in the server – that are not part of the Drake-provided glTF – must be colored white (RGB=(255, 255, 255)). These server-only renderable artifacts include:

When producing the final label output, the client will interpret this particular RGB value as render::RenderLabel::kDontCare. This means that a remote server will never report a pixel with the render::RenderLabel::kEmpty value.

For an image to be a proper color-encoded label image, the only pixel values in the image must be one of the recognized label encodings. This may require special render configurations. Any configurations that can introduce color variation must be disabled. That includes (but is not limited to) the following render features:

Existing Server Implementations


drake-blender is a glTF render server using Blender as the backend.

Developing your own Server


To test the basic client-server communication and rendering, Drake provides a simple server implementation as a reference. For more information about developing your own server or running the prototype, refer to the README under //geometry/render_gltf_client/test.