Part 1 Part 2
One thing that was learned while building the capturing part of Vine for Android was dealing with all the raw buffers in order to satisfy the stop motion requirements. (According to Instagram, they were able to use the native MediaRecorder with 700ms+ delay on start time and a minimum duration, but Vine can't afford that in order to do stop motion) And because we can't use MediaRecorder, there are other libraries that are linked in order to do the encodings.
In order to use the raw buffers, setPreviewCallbackBuffer will be used in place of setPreviewCallback and addCallbackBuffer must be called with a minimum number of frames added prior/during to preview. This way buffers will not be generated during run time so that there is no lag during recording (which causes serious frame drops). For Vine, we take the frames, put them on a concurrent queue, another thread will take the buffers from the queue, process that frame, and then put the buffer back to Camera. So for a 6 second 30fps video, a maximum of 180 frames will be needed if the user records one single long clip. There goes the problem, 180 frames of raw bytes is pretty big to allocate at first as each frame is about 1MB big to allocate them at once will likely cause OOM and turns out to be really slow. But let's look the iteration that we did to minimize the problem as well as how to make everything else faster.
Naive solution: Add 180 frames prior to startPreview, guarantee 180 frames for all phones. Doing all the allocations and initialization of classes and objects. when user starts recording.
Result: GC_ALLOC happens, OOM happens on some phones, and frag increase of heap causes the allocation to go up to 10 - 30 seconds on certain phones. Takes 1 - 2 seconds before allocation happens.
First thing I tried was to identify the bottlenecks during recording so that we don't need that many frames. Can process be faster so that we don't need that many frames?
Processing a frame really consists of four small steps so it was not hard to time them.
(all the times are relative to the paragraph and to each other instead of real times since it varies by device)
1. Convert a NV21 frame to a Bitmap for manipulation. (Time: 50x)
2. Doing bitmap manipulation on the converted Bitmap. (Time: 5x)
3. Encode the bitmap. (Time: 20x)
4. Write to the container. (Time: 1x)
1. If conversion in Java takes about 50x, can we do it better in native? Or is there a better solution. It turns out, if we do color conversion on GPU via an intrinsic RenderScript (super optimized conversion script), we can make it go from 50x to 1x with just a few lines of code. Unfortunately, this is Android 4.2+ only at time of writing but a support library may come in the future to back port this to older Android devices.
2. All the bitmap manipulations were separated (rotation, clip, inversion), if we use a single Matrix, time was modified from 5x to 2x.
3. Encoding, there isn't that much we can do here since the encoding algorithm is already optimized. If we use MediaCodec, time would be down from 20x to 10x, but this is 4.1+ and there is no sign that a support library may support this in the future.
4. Writing it to the container is super fast, nothing to be done here for now.
What did this was that we can now cut down from 180 requirement to a 140 requirement on certain devices, and 120 requirement on 4.2+ devices. (We have a device profiling system for this).
Improved processing solution: Add 140 frames prior to startPreview, guarantee 140 frames for all phones. Doing all the allocations and initialization of classes and objects. when user starts recording.
Result: GC_ALLOC happens less, OOM happens on some phones but less, and frag increase of heap causes the allocation to go up to 5 seconds on certain phones before they can start recording. Takes 1 - 2 seconds before allocation happens. (The big improvement here happen because GC on the last 40 frames is usually the slowest).
This is still unacceptable.
Improve allocation speed: Lying to get more memory is good.
Why does GC happen? Why is growing heap even needed if we know how much we need?
GC happens when the allocated heap is hitting about 70% capacity. And heap grows in frag because we only asks for a small byte at a time.
It turns out, right before adding small buffers, I can add the following code to make it 100x faster:
temp = new byte[140 * requiredSize * 1.5] ;
temp = 1;
temp = null; //Explicit.
This makes GCALLOC happens much much less (sometimes only once) and no more heap growing more than once.
Result: GC_ALLOC happens much less, OOM happens faster, allocation time to go up to 2 seconds on certain phones before they can start recording. Takes 1 - 2 seconds before allocation happens.
Much better, but can we do better?
The rest of the improvements that we did we around using a service that maintains class loadings, using a bytebuffer queue when they restart recording so that we don't have to allocate more buffers, eventually bring the OOMs down to a very very small number, and allocation times to about 1.5s. The details are not important but what's important is that there was so much room for improvements and at many places that we did not expect to make a huge impact. Timing the execution and using GMAT like tools were very important at first for us to identify the bottle necks.