skip to Main Content

I’m trying to use the Tensorflow’s CTC implementation under contrib package (tf.contrib.ctc.ctc_loss) without success.

  • First of all, anyone know where can I read a good step-by-step tutorial? Tensorflow’s documentation is very poor on this topic.
  • Do I have to provide to ctc_loss the labels with the blank label interleaved or not?
  • I could not be able to overfit my network even using a train dataset of length 1 over 200 epochs. 🙁
  • How can I calculate the label error rate using tf.edit_distance?

Here is my code:

with graph.as_default():

  max_length = X_train.shape[1]
  frame_size = X_train.shape[2]
  max_target_length = y_train.shape[1]

  # Batch size x time steps x data width
  data = tf.placeholder(tf.float32, [None, max_length, frame_size])
  data_length = tf.placeholder(tf.int32, [None])

  #  Batch size x max_target_length
  target_dense = tf.placeholder(tf.int32, [None, max_target_length])
  target_length = tf.placeholder(tf.int32, [None])

  #  Generating sparse tensor representation of target
  target = ctc_label_dense_to_sparse(target_dense, target_length)

  # Applying LSTM, returning output for each timestep (y_rnn1, 
  # [batch_size, max_time, cell.output_size]) and the final state of shape
  # [batch_size, cell.state_size]
  y_rnn1, h_rnn1 = tf.nn.dynamic_rnn(
    tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_tuple=True, num_proj=num_classes), #  num_proj=num_classes
    data,
    dtype=tf.float32,
    sequence_length=data_length,
  )

  #  For sequence labelling, we want a prediction for each timestamp. 
  #  However, we share the weights for the softmax layer across all timesteps. 
  #  How do we do that? By flattening the first two dimensions of the output tensor. 
  #  This way time steps look the same as examples in the batch to the weight matrix. 
  #  Afterwards, we reshape back to the desired shape


  # Reshaping
  logits = tf.transpose(y_rnn1, perm=(1, 0, 2))

  #  Get the loss by calculating ctc_loss
  #  Also calculates
  #  the gradient.  This class performs the softmax operation for you, so    inputs
  #  should be e.g. linear projections of outputs by an LSTM.
  loss = tf.reduce_mean(tf.contrib.ctc.ctc_loss(logits, target, data_length))

  #  Define our optimizer with learning rate
  optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)

  #  Decoding using beam search
  decoded, log_probabilities = tf.contrib.ctc.ctc_beam_search_decoder(logits, data_length, beam_width=10, top_paths=1)

Thanks!

Update (06/29/2016)

Thank you, @jihyeon-seo! So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we’ve to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function. Did I do everything correct?

This is the code:

    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)

    #  Reshaping to share weights accross timesteps
    x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])

    self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1

    #  Reshaping
    self._logits = tf.reshape(self._logits, [max_length, -1, num_classes])

    #  Calculating loss
    loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)

    self.cost = tf.reduce_mean(loss)

Update (07/11/2016)

Thank you @Xiv. Here is the code after the bug fix:

    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)

    #  Reshaping to share weights accross timesteps
    x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])

    self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1

    #  Reshaping
    self._logits = tf.reshape(self._logits, [-1, max_length, num_classes])
    self._logits = tf.transpose(self._logits, (1,0,2))

    #  Calculating loss
    loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)

    self.cost = tf.reduce_mean(loss)

Update (07/25/16)

I published on GitHub part of my code, working with one utterance. Feel free to use! 🙂

2

Answers


  1. I’m trying to do the same thing.
    Here’s what I found you may be interested in.

    It was really hard to find the tutorial for CTC, but this example was helpful.

    And for the blank label, CTC layer assumes that the blank index is num_classes - 1, so you need to provide an additional class for the blank label.

    Also, CTC network performs softmax layer. In your code, RNN layer is connected to CTC loss layer. Output of RNN layer is internally activated, so you need to add one more hidden layer (it could be output layer) without activation function, then add CTC loss layer.

    Login or Signup to reply.
  2. See here for an example with bidirectional LSTM, CTC, and edit distance implementations, training a phoneme recognition model on the TIMIT corpus. If you train on that corpus’s training set, you should be able to get phoneme error rates down to 20-25% after 120 epochs or so.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search