Key questions

Not every data is what it seems to be

JavaScript

20 October 2022

There is an ancient mayan saying that computers can solve a lot of problems that we wouldn't have to solve without them. Today, we can sink our teeth into a problem just like that. It pairs really well with this lightning talk from 2012 called Wat.

My own Wat-moment started with the Buffer class. Let's get two of them right away.

$ node
> data1 = Buffer.from([0xf5, 0xcf, 0xe2, 0xf0, 0xef])
<Buffer f5 cf e2 f0 ef>
> data2 = Buffer.from([0xfe, 0x99, 0x88, 0xeb, 0xd9])
<Buffer fe 99 88 eb d9>

It's clearly visible to the naked eye that these are indeed two different buffers, but the Node.js can also confirm it for us:

> data1 === data2
false

That's all nice and shiny, but let's look at another example.

> container = {}
{}
> container[data1] = 'foo'
'foo'
> container[data2]
???

What will be the value of the last expression?

a) null
b) undefined
c) it creates a black hole in place of the node interpreter
d) nothing

Maybe a lot of people would go with the b. Maybe someone who knows Node.js a bit better would pick c. But the right answer is so terrible that it's not even an option.

> container[data2]
'foo'

What happens behind the scenes? A key of an object cannot be a Buffer type so it calls a toString method on it automatically. In case of the Buffer type, the toString can have an optional encoding parameter, but if it doesn't get one it'll go with utf8 by default.

Our good-looking byte array doesn't know anything about behaving as a well formed UTF-8 string (that's why it's in our example), so all its bytes are replaced with the Unicode replacement character, which looks like this: �.

Both of our buffers are ignorant in this regard so at the end of the conversion they both contain only five replacement characters.

> data1.toString() === data2.toString()
true
> container
{ '�����': 'foo' }

After all this it seems reasonable that we get back the value for the first data when we use the second data as the key. Now imagine this situation deep down in an in-memory cache layer and the only symptom you see is that sometimes, maybe once in a hundred thousand cases the data from the cache is not right. It's a really fun experience.

What could we do about this? Maybe we are better not using the Buffer type as a key, but if we really need to, we could call the toString with a different encoding parameter. The examples below could all work in this case:

> data1.toString('hex') === data2.toString('hex')
false
> data1.toString('base64') === data2.toString('base64')
false
> data1.toString('binary') === data2.toString('binary')
false

deadlime

Key questions

Have a comment?

Want to subscribe?