struct printk_ringbuffer
------------------------
John Ogness <john.ogness@linutronix.de>

Overview
~~~~~~~~
As the name suggests, this ring buffer was implemented specifically to serve
the needs of the printk() infrastructure. The ring buffer itself is not
specific to printk and could be used for other purposes. _However_, the
requirements and semantics of printk are rather unique. If you intend to use
this ring buffer for anything other than printk, you need to be very clear on
its features, behavior, and pitfalls.

Features
^^^^^^^^
The printk ring buffer has the following features:

- single global buffer
- resides in initialized data section (available at early boot)
- lockless readers
- supports multiple writers
- supports multiple non-consuming readers
- safe from any context (including NMI)
- groups bytes into variable length blocks (referenced by entries)
- entries tagged with sequence numbers

Behavior
^^^^^^^^
Since the printk ring buffer readers are lockless, there exists no
synchronization between readers and writers. Basically writers are the tasks
in control and may overwrite any and all committed data at any time and from
any context. For this reason readers can miss entries if they are overwritten
before the reader was able to access the data. The reader API implementation
is such that reader access to entries is atomic, so there is no risk of
readers having to deal with partial or corrupt data. Also, entries are
tagged with sequence numbers so readers can recognize if entries were missed.

Writing to the ring buffer consists of 2 steps. First a writer must reserve
an entry of desired size. After this step the writer has exclusive access
to the memory region. Once the data has been written to memory, it needs to
be committed to the ring buffer. After this step the entry has been inserted
into the ring buffer and assigned an appropriate sequence number.

Once committed, a writer must no longer access the data directly. This is
because the data may have been overwritten and no longer exists. If a
writer must access the data, it should either keep a private copy before
committing the entry or use the reader API to gain access to the data.

Because of how the data backend is implemented, entries that have been
reserved but not yet committed act as barriers, preventing future writers
from filling the ring buffer beyond the location of the reserved but not
yet committed entry region. For this reason it is *important* that writers
perform both reserve and commit as quickly as possible. Also, be aware that
preemption and local interrupts are disabled and writing to the ring buffer
is processor-reentrant locked during the reserve/commit window. Writers in
NMI contexts can still preempt any other writers, but as long as these
writers do not write a large amount of data with respect to the ring buffer
size, this should not become an issue.

API
~~~

Declaration
^^^^^^^^^^^
The printk ring buffer can be instantiated as a static structure:

 /* declare a static struct printk_ringbuffer */
 #define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr)

The value of szbits specifies the size of the ring buffer in bits. The
cpulockptr field is a pointer to a prb_cpulock struct that is used to
perform processor-reentrant spin locking for the writers. It is specified
externally because it may be used for multiple ring buffers (or other
code) to synchronize writers without risk of deadlock.

Here is an example of a declaration of a printk ring buffer specifying a
32KB (2^15) ring buffer:

....
DECLARE_STATIC_PRINTKRB_CPULOCK(rb_cpulock);
DECLARE_STATIC_PRINTKRB(rb, 15, &rb_cpulock);
....

If writers will be using multiple ring buffers and the ordering of that usage
is not clear, the same prb_cpulock should be used for both ring buffers.

Writer API
^^^^^^^^^^
The writer API consists of 2 functions. The first is to reserve an entry in
the ring buffer, the second is to commit that data to the ring buffer. The
reserved entry information is stored within a provided `struct prb_handle`.

 /* reserve an entry */
 char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
                   unsigned int size);

 /* commit a reserved entry to the ring buffer */
 void prb_commit(struct prb_handle *h);

Here is an example of a function to write data to a ring buffer:

....
int write_data(struct printk_ringbuffer *rb, char *data, int size)
{
    struct prb_handle h;
    char *buf;

    buf = prb_reserve(&h, rb, size);
    if (!buf)
        return -1;
    memcpy(buf, data, size);
    prb_commit(&h);

    return 0;
}
....

Pitfalls
++++++++
Be aware that prb_reserve() can fail. A retry might be successful, but it
depends entirely on whether or not the next part of the ring buffer to
overwrite belongs to reserved but not yet committed entries of other writers.
Writers can use the prb_inc_lost() function to allow readers to notice that a
message was lost.

Reader API
^^^^^^^^^^
The reader API utilizes a `struct prb_iterator` to track the reader's
position in the ring buffer.

 /* declare a pre-initialized static iterator for a ring buffer */
 #define DECLARE_STATIC_PRINTKRB_ITER(name, rbaddr)

 /* initialize iterator for a ring buffer (if static macro NOT used) */
 void prb_iter_init(struct prb_iterator *iter,
                    struct printk_ringbuffer *rb, u64 *seq);

 /* make a deep copy of an iterator */
 void prb_iter_copy(struct prb_iterator *dest,
                    struct prb_iterator *src);

 /* non-blocking, advance to next entry (and read the data) */
 int prb_iter_next(struct prb_iterator *iter, char *buf,
                   int size, u64 *seq);

 /* blocking, advance to next entry (and read the data) */
 int prb_iter_wait_next(struct prb_iterator *iter, char *buf,
                        int size, u64 *seq);

 /* position iterator at the entry seq */
 int prb_iter_seek(struct prb_iterator *iter, u64 seq);

 /* read data at current position */
 int prb_iter_data(struct prb_iterator *iter, char *buf,
                   int size, u64 *seq);

Typically prb_iter_data() is not needed because the data can be retrieved
directly with prb_iter_next().

Here is an example of a non-blocking function that will read all the data in
a ring buffer:

....
void read_all_data(struct printk_ringbuffer *rb, char *buf, int size)
{
    struct prb_iterator iter;
    u64 prev_seq = 0;
    u64 seq;
    int ret;

    prb_iter_init(&iter, rb, NULL);

    for (;;) {
        ret = prb_iter_next(&iter, buf, size, &seq);
        if (ret > 0) {
            if (seq != ++prev_seq) {
                /* "seq - prev_seq" entries missed */
                prev_seq = seq;
            }
            /* process buf here */
        } else if (ret == 0) {
            /* hit the end, done */
            break;
        } else if (ret < 0) {
            /*
             * iterator is invalid, a writer overtook us, reset the
             * iterator and keep going, entries were missed
             */
            prb_iter_init(&iter, rb, NULL);
        }
    }
}
....

Pitfalls
++++++++
The reader's iterator can become invalid at any time because the reader was
overtaken by a writer. Typically the reader should reset the iterator back
to the current oldest entry (which will be newer than the entry the reader
was at) and continue, noting the number of entries that were missed.

Utility API
^^^^^^^^^^^
Several functions are available as convenience for external code.

 /* query the size of the data buffer */
 int prb_buffer_size(struct printk_ringbuffer *rb);

 /* skip a seq number to signify a lost record */
 void prb_inc_lost(struct printk_ringbuffer *rb);

 /* processor-reentrant spin lock */
 void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);

 /* processor-reentrant spin unlock */
 void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);

Pitfalls
++++++++
Although the value returned by prb_buffer_size() does represent an absolute
upper bound, the amount of data that can be stored within the ring buffer
is actually less because of the additional storage space of a header for each
entry.

The prb_lock() and prb_unlock() functions can be used to synchronize between
ring buffer writers and other external activities. The function of a
processor-reentrant spin lock is to disable preemption and local interrupts
and synchronize against other processors. It does *not* protect against
multiple contexts of a single processor, i.e NMI.

Implementation
~~~~~~~~~~~~~~
This section describes several of the implementation concepts and details to
help developers better understand the code.

Entries
^^^^^^^
All ring buffer data is stored within a single static byte array. The reason
for this is to ensure that any pointers to the data (past and present) will
always point to valid memory. This is important because the lockless readers
may be preempted for long periods of time and when they resume may be working
with expired pointers.

Entries are identified by start index and size. (The start index plus size
is the start index of the next entry.) The start index is not simply an
offset into the byte array, but rather a logical position (lpos) that maps
directly to byte array offsets.

For example, for a byte array of 1000, an entry may have have a start index
of 100. Another entry may have a start index of 1100. And yet another 2100.
All of these entry are pointing to the same memory region, but only the most
recent entry is valid. The other entries are pointing to valid memory, but
represent entries that have been overwritten.

Note that due to overflowing, the most recent entry is not necessarily the one
with the highest lpos value. Indeed, the printk ring buffer initializes its
data such that an overflow happens relatively quickly in order to validate the
handling of this situation. The implementation assumes that an lpos (unsigned
long) will never completely wrap while a reader is preempted. If this were to
become an issue, the seq number (which never wraps) could be used to increase
the robustness of handling this situation.

Buffer Wrapping
^^^^^^^^^^^^^^^
If an entry starts near the end of the byte array but would extend beyond it,
a special terminating entry (size = -1) is inserted into the byte array and
the real entry is placed at the beginning of the byte array. This can waste
space at the end of the byte array, but simplifies the implementation by
allowing writers to always work with contiguous buffers.

Note that the size field is the first 4 bytes of the entry header. Also note
that calc_next() always ensures that there are at least 4 bytes left at the
end of the byte array to allow room for a terminating entry.

Ring Buffer Pointers
^^^^^^^^^^^^^^^^^^^^
Three pointers (lpos values) are used to manage the ring buffer:

 - _tail_: points to the oldest entry
 - _head_: points to where the next new committed entry will be
 - _reserve_: points to where the next new reserved entry will be

These pointers always maintain a logical ordering:

 tail <= head <= reserve

The reserve pointer moves forward when a writer reserves a new entry. The
head pointer moves forward when a writer commits a new entry.

The reserve pointer cannot overwrite the tail pointer in a wrap situation. In
such a situation, the tail pointer must be "pushed forward", thus
invalidating that oldest entry. Readers identify if they are accessing a
valid entry by ensuring their entry pointer is `>= tail && < head`.

If the tail pointer is equal to the head pointer, it cannot be pushed and any
reserve operation will fail. The only resolution is for writers to commit
their reserved entries.

Processor-Reentrant Locking
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The purpose of the processor-reentrant locking is to limit the interruption
scenarios of writers to 2 contexts. This allows for a simplified
implementation where:

- The reserve/commit window only exists on 1 processor at a time. A reserve
  can never fail due to uncommitted entries of other processors.

- When committing entries, it is trivial to handle the situation when
  subsequent entries have already been committed, i.e. managing the head
  pointer.

Performance
~~~~~~~~~~~
Some basic tests were performed on a quad Intel(R) Xeon(R) CPU E5-2697 v4 at
2.30GHz (36 cores / 72 threads). All tests involved writing a total of
32,000,000 records at an average of 33 bytes each. Each writer was pinned to
its own CPU and would write as fast as it could until a total of 32,000,000
records were written. All tests involved 2 readers that were both pinned
together to another CPU. Each reader would read as fast as it could and track
how many of the 32,000,000 records it could read. All tests used a ring buffer
of 16KB in size, which holds around 350 records (header + data for each
entry).

The only difference between the tests is the number of writers (and thus also
the number of records per writer). As more writers are added, the time to
write a record increases. This is because data pointers, modified via cmpxchg,
and global data access in general become more contended.

1 writer
^^^^^^^^
 runtime: 0m 18s
 reader1: 16219900/32000000 (50%) records
 reader2: 16141582/32000000 (50%) records

2 writers
^^^^^^^^^
 runtime: 0m 32s
 reader1: 16327957/32000000 (51%) records
 reader2: 16313988/32000000 (50%) records

4 writers
^^^^^^^^^
 runtime: 0m 42s
 reader1: 16421642/32000000 (51%) records
 reader2: 16417224/32000000 (51%) records

8 writers
^^^^^^^^^
 runtime: 0m 43s
 reader1: 16418300/32000000 (51%) records
 reader2: 16432222/32000000 (51%) records

16 writers
^^^^^^^^^^
 runtime: 0m 54s
 reader1: 16539189/32000000 (51%) records
 reader2: 16542711/32000000 (51%) records

32 writers
^^^^^^^^^^
 runtime: 1m 13s
 reader1: 16731808/32000000 (52%) records
 reader2: 16735119/32000000 (52%) records

Comments
^^^^^^^^
It is particularly interesting to compare/contrast the 1-writer and 32-writer
tests. Despite the writing of the 32,000,000 records taking over 4 times
longer, the readers (which perform no cmpxchg) were still unable to keep up.
This shows that the memory contention between the increasing number of CPUs
also has a dramatic effect on readers.

It should also be noted that in all cases each reader was able to read >=50%
of the records. This means that a single reader would have been able to keep
up with the writer(s) in all cases, becoming slightly easier as more writers
are added. This was the purpose of pinning 2 readers to 1 CPU: to observe how
maximum reader performance changes.