skip to Main Content

Let’s say I have some text:

Lorem ipsum dolor sit amet, consectetur adipiscing elit,n
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.n
Ut enim ad minim veniam, quis nostrud exercitation ullamco laborisn
nisi ut aliquip ex ea commodo consequat.n
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum doloren
eu fugiat nulla pariatur.n
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui offician
deserunt mollit anim id est laborum.n

What is the most efficient way to cut it into chunks of x bytes, where the cut can only happen at the carriage return?

Two methods come to mind:

  • split the text into lines, add lines to a buffer until the buffer is full, roll back the last line that caused the overflow, and repeat.

  • find the offset in the text at the buffer length and walk back to the previous carriage return, with proper handling of the beginning and ending of the text

I couldn’t find a solution online, but I can’t believe that this problem hasn’t already been solved many times, and there may be a common implementation of this.


Edit:

more information about my use case:

The code is for a Telegram bot which is used as a communication tool with an internal system.

Telegram allows up to 4kb per message and throttles the number of calls.

Right now I collect all messages, put them in a concurrent queue and then a tasks flushes the queue every second.

Messages can be a single line, can be a collection of lines and can sometimes be larger than 4kb.

I take all the messages (some being multiple lines in one block), aggregate them into a single string, then split the string by carriage return and then I can compose blocks of up to 4kb.
One additional problem I haven’t tackled yet, but that’s for later, is that Telegram will reject incomplete markup, so I will also need to cut the text based on that at some point.

2

Answers


  1. Not very efficient, and also laboring under the assumptions

    • that you may want to preserve the newline separators, and
    • that we can assume that the end of the string is equivalent
      to a single newline;

    then, an implementation along the lines of your first approach is both functional and straightforward. Just split into lines and combine them unless their combined length exceeds the threshold.

    // Comma-separated output of the string lengths
    // (plus 1 to compensate for the absence of the EOL)
    let printLengths =
        Array.map (String.length >> (+) 1 >> string)
        >> String.concat ", "
        >> printfn "%s"
    let text = 
        "Lorem ipsum dolor sit amet, consectetur adipiscing elit,
    sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
    nisi ut aliquip ex ea commodo consequat.
    Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
    eu fugiat nulla pariatur.
    Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
    deserunt mollit anim id est laborum.
    
    "
    text.Split 'n' |> printLengths
    // prints 57, 67, 67, 41, 77, 26, 74, 37, 1, 1
    
    let foo n (text : string) =
        (text.Split 'n', [])
        ||> Array.foldBack (fun t -> function
        | x::xs when String.length x + t.Length + 1 < n -> x+"n"+t::xs
        | xs -> t::xs )
    text |> foo 108 |> List.toArray |> printLengths
    // prints 57, 67, 108, 77, 100, 39
    
    Login or Signup to reply.
  2. Most common stream related tasks are already implemented very efficiently in the BCL.
    It’s probably a good idea to stick with tried-and-tested Stream classes.

    let lipsum  = 
        """
        Lorem ipsum dolor sit amet, consectetur adipiscing elit,
        sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
        Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
        nisi ut aliquip ex ea commodo consequat.
        Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
        eu fugiat nulla pariatur.
        Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
        deserunt mollit anim id est laborum.
        """
    
    use stream = new MemoryStream(Encoding.UTF8.GetBytes(lipsum))
    use reader =  new StreamReader(stream)
    
    let readBlock blockSize = 
        let writer = new StringBuilder(capacity = blockSize)
        let rec readNextline () =
            if (not reader.EndOfStream) then do
                let line = reader.ReadLine()
                if writer.Capacity < line.Length + writer.Length then do
                    stream.Seek(int64 -line.Length, SeekOrigin.Current) |> ignore                                
                else
                    writer.AppendLine(line) |> ignore
                    readNextline ()
    
        readNextline ()
        writer.ToString()
    
    readBlock 300 |> printfn "%s"
    

    You can just flush the queue, writing to the same MemoryStream. And call readBlock to keep getting new blocks of at-most your specified size.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search