Fixing uniq in Serenity OS

Home

Fixing `uniq` in Serenity OS

I had known about Serenity for several years but only recently decided that it would be cool if I made a contribution. I spent a few weeks thinking of what I could do, mostly resulting in overly ambitious project ideas. Eventually I had enough of the brainstorming and started looking through the codebase and see if I could find anything interesting without having a preconceived "project idea." I noticed that there are many FIXME comments throughout the code so I decided that finding a simple thing to fix would be an excellent starting point.

Finding a `FIXME`

Many of the FIXME's I found were beyond my skill level or otherwise infeasible to tackle without a deeper knowledge of the codebase, so I eventually settled on one I found in their implementation of the uniq utility which read The buffer does not automatically resize,and this will return EMSGSIZE if the read line is more than 1024 bytes. This seemed very doable.

The read_line method in BufferedStream takes a buffer from the user and attempts to fill it up to the newline character. If the line too long to fit into the buffer, two calls to read_line are required, causing the line to be treated as two lines, essentially. This causes uniq to not behave correctly; if there are two identical lines that are longer than the buffer, they cannot be compared directly, since the first line will get partially read in and then compared with the contents of the remainder of the line. This results in both identical lines being printed out.

My first attempts ended with memory getting deallocated out from under me. The code made use of StringViews of the buffer passed to read_line, assuming that the buffer would not change. When I resized the buffer, a reallocation occurred which invalidated all StringViews of that buffer. I wasn't sure the best way to proceed, so I asked for help on the official Discord server. After some discussion it was decided that we can add a read_line_with_resize method to BufferedStream that automatically adjusts the user supplied buffer in the case that a line does not fit. Although adding a method to a core library might not always be wise, it made sense here since many utilities require this functionality and it would be a waste of effort to write the same detect overflow/resize/re-read line code in every instance where this functionality is required. As a side note, some of the utilities get around this issue by using the libc function getline, which will reallocate the supplied buffer if it is too small.

Detour through `BufferedStream`

Implementing read_line_with_resize required some planning for how to handle the different scenarios that could arise:

The user supplied buffer can fit the line: no resizing needed, just read the line into the buffer.
The user supplied buffer cannot fit the line: resize the buffer then read the line into the buffer.
The internal buffer used by BufferedStream cannot fit the line: read the line into the user supplied buffers in chunks, where each chunk is at most the size of the internal buffer, and resize as needed.

After a couple of iterations and suggestions from the reviewer [1]., the implementation ended up fairly short (read_line_with_resize calls this method eventually, which will handle any delimiter):

    template<size_t N>
    ErrorOr<Bytes> read_until_any_of_with_resize(ByteBuffer& buffer, Array<StringView, N> candidates)
    {
        if (!stream().is_open())
            return Error::from_errno(ENOTCONN);

        auto candidate = TRY(find_and_populate_until_any_of(candidates));

        size_t bytes_read_to_user_buffer = 0;
        while (!candidate.has_value()) {
            if (m_buffer.used_space() == 0 && stream().is_eof()) {
                // If we read to the very end of the buffered and unbuffered data,
                // then treat the remainder as a full line (the last one), even if it
                // doesn't end in the delimiter.
                return buffer.span().trim(bytes_read_to_user_buffer);
            }

            if (buffer.size() - bytes_read_to_user_buffer < m_buffer.used_space()) {
                // Resize the user supplied buffer because it cannot fit
                // the contents of m_buffer.
                TRY(buffer.try_resize(buffer.size() + m_buffer.used_space()));
            }

            // Read bytes into the buffer starting from the offset of how many bytes have previously been read.
            bytes_read_to_user_buffer += m_buffer.read(buffer.span().slice(bytes_read_to_user_buffer)).size();
            candidate = TRY(find_and_populate_until_any_of(candidates));
        }

        // Once the candidate has been found, read the contents of m_buffer into the buffer,
        // offset by how many bytes have already been read in.
        TRY(buffer.try_resize(bytes_read_to_user_buffer + candidate->offset));
        m_buffer.read(buffer.span().slice(bytes_read_to_user_buffer));
        TRY(m_buffer.discard(candidate->size));
        return buffer.span();
    }

One thing to note about this code is the use of ErrorOr and TRY which is explained in the documentation.

A major issue I had at first was testing; since BufferedStream is included in many other files, modifications to it triggered a fairly gnarly recompilation that took around 20-30 minutes on my laptop. Thankfully, the Serenity developers point out that certain subsystems could be run on the host without having to build the entire OS. After writing tests covering each of the scenarios outlined above, I was able to get a working implementation much faster. The full pull request can be found here.

Back to `uniq`

With read_line_with_resize implemented, I could now address that FIXME in uniq. The specification for uniq describes its expected behavior as follows: The uniq utility shall read an input file comparing adjacent lines, and write one copy of each input line on the output. The second and succeeding copies of repeated adjacent input lines shall not be written. This was implemented by maintaining two buffers, one for the current line and one for the previous line, swapping the current and previous buffers on each iteration, and then reading in a new line into the current buffer. The main change that needed to be made after replacing read_line with read_line_with_resize was to no longer swap StringViews, as they could be invalidated after a resize and instead just declare a new StringView of the buffer as follows:

    swap(current_buf, previous_buf);
    // The StringViews cannot be swapped since read_line_with_resize
    // potentially changes the location of the buffers due to reallocation.
    // Instead create a new StringView of what was most recently read in.
    previous = StringView { previous_buf.span().trim(current.length()) };

I started writing tests for these changes and noticed some incongruencies with the output of this program that of the coreutils version. It turned out that the last line of repeated lines was being output rather than the first. Although this won't be apparent when comparing whole lines, when using the -f or -s flags to limit the comparison to some offset from the beginning of the line it becomes apparent that the wrong line is being output. Fixing this was a matter of adding a loop inside the main loop to skip over lines while they match with the current line. Lastly, I fixed a couple of off by one errors that were affecting the -c and -f flags. The full pull request can be found here [2].

1. Thanks to timschumi for his help on this.

2. Thanks to tcl3 for reviewing this PR.