WEAKLY SUPERVISED MICROSCOPY CELL SEGMENTATION VIA CONVOLUTIONAL LSTM NETWORKS Assaf Arbelle, Shaked Cohen and Tammy Riklin Raviv 1. METHODS We address individual cells’ segmentation from microscopy sequences. The main challenge in this type of problems is not only foreground-background classification but also the sepa- ration of adjacent cells. We apply two orthogonal approaches to overcome the multiple instance problem. From the seg- mentation perspective, we adopt the three class loss used by [1], [2]. The segmentation representation is designed to en- hance individual cells’ delineation by a partitioning of image domain into three classes: foreground, background and cell contours. From the detection perspective, we get our inspi- ration from [3], and aim to detect rough cell markers. The markers, as opposed to the full segmentation, do not cover the entire cell, but are rather a small ”blob” somewhere within the cell. The markers have two desirable properties. First, they are much smaller than the object and thus are easier to separate between instances. One marker will never overlap or touch boundaries with a neighboring marker. Second, the markers are easy to annotate, as the annotator does not need to be precise, making data acquisition a simpler task. Often, for microscopy image sequences, the only available annota- tion is in the form of markers or approximate cell centers. We train the proposed network to estimate both the segmentation and the markers and merge the two using the Fast Marching Distance (FMD) [4]. The entire framework is illustrated in Figure 1. 1.1. Input and Output The input to the method is a sequence of live cell microscopy images of arbitrary length T . We define the d dimensional (2 or 3) image domain by Ω ∈ R d . We denote a frame in the input image sequence as I t :Ω → R , where the sub- script t ∈ [0,T - 1] denotes the frame index and I t (v) is the intensity of a pixel (or voxel), v ∈ Ω. The output of the network consists of two components, the scalar marker map (Section 1.4) M t :Ω → [0, 1] which represents the probabil- ity of a pixel (voxel) to belong to a marker (cell segmentation core) and the soft segmentation map (Section 1.3) denoted as S t :Ω → [0, 1] 3 which represents the probabilities of each A Arbelle, S Cohen and T Riklin Raviv are with The Department of Electrical and Computer Engineering, and The Zlotowski Center for Neuro- science, Ben- Gurion University of the Negev. pixel (voxel) to belong to either the foreground, background or cell boundary. These two maps are then passed to an in- stance segmentation block (Section 1.6) which outputs the fi- nal labeled segmentation, map Γ t :Ω → N + . Figure 1 shows an overview of the proposed method with visualization of the intermediate steps. 1.2. LSTM-UNet The proposed network incorporates C-LSTM [5] blocks into the U-Net [6] architecture. This combination, first suggested in our preliminary work [1], is shown to be powerful. The UNet architecture, built as an encoder-decoder with skip con- nections, enables to extract meaningful descriptors at multi- ple image scales. However, this alone does not account for the cell specific dynamics that can significantly support the segmentation. The introduction of C-LSTM blocks into the network allows considering past cell appearances at multi- ple scales by holding their compact representations in the C- LSTM memory units. We propose here the incorporation of C-LSTM layers in every scale of the encoder section of the U-Net. Applying the C-LSTM on multiple scales is essen- tial for cell microscopy sequences since the frame to frame differences might be at different scales, depending on cells’ dynamics. The specific architecture was selected based on preliminary work which shows the empirical advantage over other alternatives [1]. The network is fully convolutional and, therefore, can be used with any image size 1 during both train- ing and testing. Figure 1 illustrates the network architecture detailed in Section 2. The network is composed of two sec- tions of N blocks each, the encoder recurrent block E {n} θn (·) and the decoder block D {n} θn (·) where θ n are the network’s parameters. The input to the C-LSTM encoder layer n ∈ [0,...,N - 1] at time t ∈ T includes the down-sampled out- put of the previous layer; the output of the current layer at the previous time-step; and the C-LSTM memory cell. We denote these three inputs as x {n} t , h {n} t-1 , c {n} t-1 respectively. Formally we define: (h {n} t ,c {n} t )= E {n} θn (x {n} t ,h {n} t-1 ,c {n} t-1 ) (1) 1 In order to avoid artefacts it is preferable to use image sizes which are multiples of eight due to the three max-pooling layers.