Thrift Bug


One of the toughest bugs to fix I’ve run into had to do with Apache Thrift’s golang library. I thought it would be fun to dive into how the bug got spotted, how I finally managed to reproduce it, and how I finally managed to fix it (or well, make it easier to handle).

Background

As part of my job at PayPal, I created a library to interact with Apache Aurora which I named gorealis.

gorealis was a stripped down re-implementation of the Python client that was already available as part of the Apache Aurora project. Unlike the fancy DSL that Aurora’s default client provided, gorealis usually stuck to presenting the Aurora Scheduler’s Thrift API directly for consumption by users.

As part of creating the library, I used the tooling provided by the Thrift project to generate the necessary golang bindings to make Thrift calls.

This worked pretty well (aside from when the project decided to change the code generator to include context as part of all function arguments from one minor revision to another).

Along with the code generator, there is also a Thrift golang library that the generated code uses to handle making the API calls.

So all in all a golang project that implements a Thrift API will typically need to use the code generator to create bindings and the Thrift golang library to make the API calls.

The Bug

When my library started to see heavier use (I’m proud to say it continues to power the infrastructure at PayPal to this very day), the code would randomly panic:

panic(0x1446380, 0x1fcff90)
 	/usr/local/go/src/runtime/panic.go:491 +0x283
 bytes.(*Buffer).Write(0x0, 0xc42143b000, 0x87, 0x1000, 0x1, 0x0, 0x0)
 	/usr/local/go/src/bytes/buffer.go:160 +0x26
 github.company.com/project/internal-project-prod/vendor/github.com/apache/thrift/lib/go/thrift.(*THttpClient).Write(0xc421418b00, 0xc42143b000, 0x87, 0x1000, 0xc421018538, 0x9195a8, 0xc42085ae10)
 	/go/src/github.company.com/project/internal-project-prod/vendor/github.com/apache/thrift/lib/go/thrift/http_client.go:173 +0x4d
 bufio.(*Writer).Flush(0xc421418b80, 0x90f42e, 0xc42085ae10)
 	/usr/local/go/src/bufio/bufio.go:567 +0x7e
 github.company.com/project/internal-project-prod/vendor/github.com/apache/thrift/lib/go/thrift.(*TJSONProtocol).Flush(0xc4203562d0, 0x1f8d4c0, 0xc420016088, 0x0, 0x0)
 	/go/src/github.company.com/project/internal-project-prod/vendor/github.com/apache/thrift/lib/go/thrift/json_protocol.go:440 +0x36
 github.company.com/project/internal-project-prod/vendor/github.com/apache/thrift/lib/go/thrift.(*TStandardClient).Send(0xc420e55020, 0x1f8d4c0, 0xc420016088, 0x1f9b720, 0xc4203562d0, 0x11, 0x1655325, 0xe, 0x1f7f500, 0xc420360020, ...)

The users of my library could not reproduce the bug.

I couldn’t reproduce the bug.

No one could reproduce the bug.

The code was crashing pretty hard and there was no rhyme or reason to when it would hit this corner case.

It was clear something was causing the Thrift Client to close itself but it was unclear where that was happening. Even more unclear was why this was resulting in a panic.

Up to this point, I had assumed that, I, the user, was using the library wrong. After all, we were the only ones running into this issue so it couldn’t have been the Thrift golang library…right?

I tried to upgrade to latest version in hopes of fixing the issue.

I tried to change around the code that detected a network error and recreated the Thrift client.

I tried to come up with scenarios in which gorealis was accidentally closing the Thrift Client. It had to be a concurrency issue on gorealis’ side, it just had to.

But every time I walked away empty handed.

It was shot in the dark after shot in the dark trying to figure out how this bug kept happening so it could be handled gracefully.

It was starting to look like the only option was going to be to use recover() and I really didn’t want to resort to doing that.

The bug wasn’t happening with the frequency that would make a fix urgent and my users were able to work around the bug for the time being, so I kept an item on my to-do list to get back and figure out a proper fix.

Finding the bug

Eventually, I found some time to dive deep into the code and try to re-create the panic.

I continued to believe that error had to be on the gorealis side. But after several unsuccessful attempts to fix the issue, I started to question where the bug might lie.

What if I was wrong to think it was an issue on the implementation side?

What if the library had a bug in it instead?

I started out by tracking back through the stack given by the panic’s error message (which itself was hard to obtain since it went away after my customers noticed it but before I got to see it).

Starting at the top of the stack, I looked through the code for TStandardClient.Send(). Nothing seemed out of place. Next, I looked at TJSONProtocol.Flush(). Again, nothing.

And then I got to the last function (Write) called on a struct of the type THttpClient. Combing through the code here I noticed something about Close():

func (p *THttpClient) Close() error {
	if p.requestBuffer != nil {
		p.requestBuffer.Reset()
		p.requestBuffer = nil
	}
	return p.closeResponse()
}

This didn’t seem interesting by itself but I wondered:

What would happen if something tried writing to an Thrift Client that was already closed?

func (p *THttpClient) Write(buf []byte) (int, error) {
	return p.requestBuffer.Write(buf)
}

You guessed it: a panic!

I now had a theory worth testing.

Verifying the bug

I set out to verify that this was indeed the bug that was plaguing my library and causing it to panic.

So I created a Thrift client in gorealis and then immediately closed it before making a call:

r, err := realis.NewRealisClient(realis.SchedulerUrl(auroraURL))

require.NoError(t, err)
r.Close()

role := "all"
_, err = r.GetTasksWithoutConfigs(&aurora.TaskQuery{Role: &role})

And…

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x4d58fd]

goroutine 36 [running]:
testing.tRunner.func1.2({0x76f440, 0xa5fd80})
        /home/rdelvalle/sdk/go1.20.1/src/testing/testing.go:1526 +0x24e
testing.tRunner.func1()
        /home/rdelvalle/sdk/go1.20.1/src/testing/testing.go:1529 +0x39f
panic({0x76f440, 0xa5fd80})
        /home/rdelvalle/sdk/go1.20.1/src/runtime/panic.go:884 +0x213
bytes.(*Buffer).Write(0xc0001260e5?, {0xc0001a4000?, 0x0?, 0x1?})
        /home/rdelvalle/sdk/go1.20.1/src/bytes/buffer.go:168 +0x1d
github.com/apache/thrift/lib/go/thrift.(*THttpClient).Write(0x6ab2fc?, {0xc0001a4000?, 0xc0001005c8?, 0xc000058aa0?})
        /home/rdelvalle/go/pkg/mod/github.com/apache/thrift@v0.13.0/lib/go/thrift/http_client.go:173 +0x25
bufio.(*Writer).Flush(0xc00015ac40)
        /home/rdelvalle/sdk/go1.20.1/src/bufio/bufio.go:628 +0x62
github.com/apache/thrift/lib/go/thrift.(*TJSONProtocol).Flush(0xc00011a188, {0x85cb48, 0xc000126008})
        /home/rdelvalle/go/pkg/mod/github.com/apache/thrift@v0.13.0/lib/go/thrift/json_protocol.go:440 +0x34
github.com/apache/thrift/lib/go/thrift.(*TStandardClient).Send(0x7f191da95a68?, {0x85cb48, 0xc000126008}, {0x860d38?, 0xc00011a188?}, 0x58bb8?, {0x7d527f?, 0x8?}, {0x85ba00, 0xc00011a1c8})
        /home/rdelvalle/go/pkg/mod/github.com/apache/thrift@v0.13.0/lib/go/thrift/client.go:46 +0x17e
github.com/apache/thrift/lib/go/thrift.(*TStandardClient).Call(0xc000117860, {0x85cb48?, 0xc000126008?}, {0x7d527f, 0x16}, {0x85ba00?, 0xc00011a1c8?}, {0x85ba28?, 0xc00011a1d0})
        /home/rdelvalle/go/pkg/mod/github.com/apache/thrift@v0.13.0/lib/go/thrift/client.go:85 +0x91
github.com/paypal/gorealis/gen-go/apache/aurora.(*ReadOnlySchedulerClient).GetTasksWithoutConfigs(0xc000112ae0, {0x85cb48, 0xc000126008}, 0xc00012efa0)
        /home/rdelvalle/git/gorealis/gen-go/apache/aurora/auroraAPI.go:17525 +0xb8
github.com/paypal/gorealis.(*realisClient).GetTasksWithoutConfigs.func1()
        /home/rdelvalle/git/gorealis/realis.go:983 +0x36
github.com/paypal/gorealis.(*realisClient).thriftCallWithRetries.func1(0xc00011c7e0, 0xc00014bda0, 0xc00014be38, 0x9cdb10?)
        /home/rdelvalle/git/gorealis/retry.go:155 +0xaa
github.com/paypal/gorealis.(*realisClient).thriftCallWithRetries(0xc00011c7e0, 0x0, 0x7f18f4f00848?)
        /home/rdelvalle/git/gorealis/retry.go:158 +0x270
github.com/paypal/gorealis.(*realisClient).GetTasksWithoutConfigs(0xc00011c7e0, 0xc00012efa0)
        /home/rdelvalle/git/gorealis/realis.go:980 +0x11a
github.com/paypal/gorealis_test.TestThriftJSON(0xc000106d00?)
        /home/rdelvalle/git/gorealis/realis_e2e_test.go:157 +0x1cb

Bingo. The panic was now reproducible.

The Solution

For reasons unknown to me to this day (I didn’t have that much time to dive into this bug), some piece of code was closing the Thrift client out of turn and unexpectedly.

From weeks spent trying to debug this issue, it didn’t seem like it was something gorealis was doing so I have come to the conclusion that the Thrift golang library randomly closes the Thrift API client after some amount of time or after encountering an internal error.

I never did find the source of it (that is, what was closing the client in the first place), but I was able to add some sanity check that would allow users of the golang Thrift library to handle the error gracefully (and maybe more importantly, idiomatically):

func (p *THttpClient) Write(buf []byte) (int, error) {
	if p.requestBuffer == nil {
		return 0, NewTTransportException(NOT_OPEN, "Request buffer is nil, connection may have been closed.")
	}
	return p.requestBuffer.Write(buf)
}

func (p *THttpClient) WriteByte(c byte) error {
	if p.requestBuffer == nil {
		return NewTTransportException(NOT_OPEN, "Request buffer is nil, connection may have been closed.")
	}
	return p.requestBuffer.WriteByte(c)
}

func (p *THttpClient) WriteString(s string) (n int, err error) {
	if p.requestBuffer == nil {
		return 0, NewTTransportException(NOT_OPEN, "Request buffer is nil, connection may have been closed.")
	}
	return p.requestBuffer.WriteString(s)
}

So I filed a bug on Apache Thrift’s JIRA prepared the patch and sent it in for review.

Patch Merged

The patch was eventually accepted by one of the maintainers and we could finally get back to using the upstream Thrift golang library on gorealis when version v0.0.14 was released.

It’s still a mystery why the client was getting closed, but now, instead of burning down the house, it just results in a good old fashioned, run of the mill error.

It’s also worth reflecting on just how much work debugging is, how difficult it often is to debug an issue, and often, how simple it is to fix a bug as a result of the hard work done in the debugging stage.

I personally feel it’s something that’s not often acknowledged in the Open Source Community at large and it should be highlighted way more.

So if you use the Thrift golang library and you ever see the error:

Request buffer is nil, connection may have been closed.

salute the many unsung heroes that put in the hours or weeks debugging hard to reproduce bugs in Open Source projects on my behalf and celebrate the fact that open source is pretty awesome sometimes.

This post was made better with valuable feedback from