uniq
in Serenity OS
I had known about Serenity for several years but only recently decided that it would be cool if I made a contribution.
I spent a few weeks thinking of what I could do, mostly resulting in overly ambitious project ideas.
Eventually I had enough of the brainstorming and started looking through the codebase and see if I could find anything interesting without having a preconceived "project idea."
I noticed that there are many FIXME
comments throughout the code so I decided that finding a simple thing to fix would be an excellent starting point.
FIXME
Many of the FIXME
's I found were beyond my skill level or otherwise infeasible to tackle without a deeper knowledge of the codebase, so I eventually settled on one I found in their implementation of the uniq
utility which read The buffer does not automatically resize,and this will return EMSGSIZE if the read line is more than 1024 bytes. This seemed very doable.
The read_line
method in BufferedStream
takes a buffer from the user and attempts to fill it up to the newline character.
If the line too long to fit into the buffer, two calls to read_line
are required, causing the line to be treated as two lines, essentially.
This causes uniq
to not behave correctly; if there are two identical lines that are longer than the buffer, they cannot be compared directly, since the first line will get partially read in and then compared with the contents of the remainder of the line.
This results in both identical lines being printed out.
My first attempts ended with memory getting deallocated out from under me.
The code made use of StringView
s of the buffer passed to read_line
, assuming that the buffer would not change.
When I resized the buffer, a reallocation occurred which invalidated all StringView
s of that buffer.
I wasn't sure the best way to proceed, so I asked for help on the official Discord server.
After some discussion it was decided that we can add a read_line_with_resize
method to BufferedStream
that automatically adjusts the user supplied buffer in the case that a line does not fit.
Although adding a method to a core library might not always be wise, it made sense here since many utilities require this functionality and it would be a waste of effort to write the same detect overflow/resize/re-read line code in every instance where this functionality is required.
As a side note, some of the utilities get around this issue by using the libc function getline
, which will reallocate the supplied buffer if it is too small.
BufferedStream
Implementing read_line_with_resize
required some planning for how to handle the different scenarios that could arise:
BufferedStream
cannot fit the line: read the line into the user supplied buffers in chunks, where each chunk is at most the size of the internal buffer, and resize as needed.read_line_with_resize
calls this method eventually, which will handle any delimiter):
template<size_t N> ErrorOr<Bytes> read_until_any_of_with_resize(ByteBuffer& buffer, Array<StringView, N> candidates) { if (!stream().is_open()) return Error::from_errno(ENOTCONN); auto candidate = TRY(find_and_populate_until_any_of(candidates)); size_t bytes_read_to_user_buffer = 0; while (!candidate.has_value()) { if (m_buffer.used_space() == 0 && stream().is_eof()) { // If we read to the very end of the buffered and unbuffered data, // then treat the remainder as a full line (the last one), even if it // doesn't end in the delimiter. return buffer.span().trim(bytes_read_to_user_buffer); } if (buffer.size() - bytes_read_to_user_buffer < m_buffer.used_space()) { // Resize the user supplied buffer because it cannot fit // the contents of m_buffer. TRY(buffer.try_resize(buffer.size() + m_buffer.used_space())); } // Read bytes into the buffer starting from the offset of how many bytes have previously been read. bytes_read_to_user_buffer += m_buffer.read(buffer.span().slice(bytes_read_to_user_buffer)).size(); candidate = TRY(find_and_populate_until_any_of(candidates)); } // Once the candidate has been found, read the contents of m_buffer into the buffer, // offset by how many bytes have already been read in. TRY(buffer.try_resize(bytes_read_to_user_buffer + candidate->offset)); m_buffer.read(buffer.span().slice(bytes_read_to_user_buffer)); TRY(m_buffer.discard(candidate->size)); return buffer.span(); }
One thing to note about this code is the use of ErrorOr
and TRY
which is explained in the documentation.
A major issue I had at first was testing; since BufferedStream
is included in many other files, modifications to it triggered a fairly gnarly recompilation that took around 20-30 minutes on my laptop.
Thankfully, the Serenity developers point out that certain subsystems could be run on the host without having to build the entire OS.
After writing tests covering each of the scenarios outlined above, I was able to get a working implementation much faster. The full pull request can be found here.
uniq
With read_line_with_resize
implemented, I could now address that FIXME
in uniq
.
The specification for uniq
describes its expected behavior as follows: The uniq utility shall read an input file comparing adjacent lines, and write one copy of each input line on the output. The second and succeeding copies of repeated adjacent input lines shall not be written.
This was implemented by maintaining two buffers, one for the current line and one for the previous line, swapping the current and previous buffers on each iteration, and then reading in a new line into the current buffer.
The main change that needed to be made after replacing read_line
with read_line_with_resize
was to no longer swap StringView
s, as they could be invalidated after a resize and instead just declare a new StringView
of the buffer as follows:
swap(current_buf, previous_buf); // The StringViews cannot be swapped since read_line_with_resize // potentially changes the location of the buffers due to reallocation. // Instead create a new StringView of what was most recently read in. previous = StringView { previous_buf.span().trim(current.length()) };
I started writing tests for these changes and noticed some incongruencies with the output of this program that of the coreutils version.
It turned out that the last line of repeated lines was being output rather than the first.
Although this won't be apparent when comparing whole lines, when using the -f
or -s
flags to limit the comparison to some offset from the beginning of the line it becomes apparent that the wrong line is being output.
Fixing this was a matter of adding a loop inside the main loop to skip over lines while they match with the current line.
Lastly, I fixed a couple of off by one errors that were affecting the -c
and -f
flags. The full pull request can be found here[2].
1. Thanks to timschumi for his help on this.
2. Thanks to tcl3 for reviewing this PR.