Pagination - Build APIs You Won't Hate: Everyone and their dog wants an API, so you should probably learn how to build them (2014)

Build APIs You Won't Hate: Everyone and their dog wants an API, so you should probably learn how to build them (2014)

10. Pagination

10.1 Introduction

Pagination is one of those words that means something very specific to many developers, but it generally means:

the sequence of numbers assigned to pages in a book or periodical.

There are a few ways to achieve pagination, but when talking in terms of an API it means:

any way you want to go about splitting up your data into multiple HTTP requests, for the sake of limiting HTTP Response size

There are a few reasons for doing this:

1. Downloading more stuff takes longer

2. Your database might not be happy about trying to return 100,000 records in one go

3. Presentation logic iterating over 100,000 records is also no fun

As you can probably tell, 100,000 is a arbitrary number. An API could have endpoints like /places with over 1 million records, or checkins which could be unlimited. While developing an API so many people forget about this, and while 10 or 100 records will display quite quickly, infinity is considerably slower. Data grows.

A good API will allow the client to request the number of items it would like returned per HTTP request. Some developers try to be smart and use custom HTTP headers for this, but this is literally what the query string is for.

/places?number=12

Some use number, limit, per_page or whatever. I always think limit only really makes sense because SQL users are used to it and REST is not SQL, so I personally use number.

warning

Define a Maximum

When you take the limit/number parameter from the client, you absolutely have to set an upper bound on that number, make sure it is over 0 and depending on the data source you might want to make sure it is an integer as decimal places could have some interesting effects.

10.2 Paginators

I stole the word “Paginator” from Laravel, which uses a Paginator class for a very specific type of pagination. It is not the most efficient form of pagination by any means, but it is rather easy to understand and works fine on relatively small data sets.

How do Paginators Work

One approach to pagination is to count how many records there are for a specific item. So, if we count how many places there are, there will probably be some sort of SQL query like this:

1 SELECT count(*) as `total` FROM `places`

When the answer to that query comes back as 1000, the following code will be executed:

1 <?php

2 $total = count_all_the_places();

3 $page = isset($_GET['page']) ? (int) $_GET['page'] : 1;

4 $per_page = isset($_GET['number']) ? (int) $_GET['number'] : 20;

5 $page_count = ceil($total / $per_page);

With that basic math taken care of we now know how many pages there are in total, and have rounded it up with ceil(). That is a PHP function equivalent of Math.round(), which rounds it up to the nearest integer. If $total is 1000, then $page_count will be 83.333. Obviously nobody wants to go to page 83.333 so round that up to page 84.

Using these variables, an API can output some simple meta-data that goes next to the main data namespace:

1 {

2 "data": [

3 ...

4 ],

5 "pagination" : {

6 "total": 1000,

7 "count": 12,

8 "per_page": 12,

9 "current_page": 1,

10 "total_pages": 84,

11 "next_url": "https://api.example.com/places?page=2&number=12",

12 }

13 }

The names of items in this pagination example are purely based off what Kapture’s iPhone developer suggested at the time, but should portray the intent.

You basically give the client enough information to do math itself if that is something it wants to do, or you let them ingest basic HTTP links too.

Counting lots of Data is Hard

The main trouble with this method is the SELECT count(*) that is required to find out the total, which can be a very expensive request.

The first thing to mind will be caching. Sure you can cache the count, or even pre-populate the request. In many cases you certainly could, but you have to consider that most endpoints will have multiple query string parameters to customise the data returned.

/places?merchant=X

That means you will now have a single cache for ever count of places by each specific merchant. That too could be cached or pre-populated, but when it comes to geo data you have no chance:

/places?lat=42.2345&lon=1.234

Unfortuntately the chances of having multiple people request the exact same set of coordinates regularly enough to make a cache worthwhile is unlikely, especially as those coordinates point to a remote, mountainous region of Spain.

Pre-population for those results also seems highly unlikely. If you have literally millions of places then trying to count all places for somebody in Spain is just silly. Indexes can help. Slicing your data into geographic buckets and slicing it together with some clever trickery can help. Generally speaking though, using this sort of pagination introduces big-data problems to what can be potentially small-data setups, especially when you have filtering options.

This is not bad (and I have used it myself for plenty of APIs) but you definitely need to keep this sort of thing in mind.

Moving Goal Posts

Another tricky issue with the “count everything then pick which page number” approach is that if a new item is added between HTTP requests, the same content can show up twice.

Imagine the scenario, where the number per page is set to 2, places are ordered by name, and the values are hip bars in Brooklyn, NY:

· Page 1

o Barcade

o Pickle Shack

· Page 2

o Videology

If the client requests Page 1, then they will see the first two results. While the results for Page 1 are being displayed to the end user, some hip new bar opens up with the name “Lucky Dog” and joins the platform.

Now the data set looks like this:

· Page 1

o Barcade

o Lucky Dog

· Page 2

o Pickle Shack

o Videology

If the client does not refresh Page 1 (which most would not do for the sake of speed) then “Pickle Shack” is going to show up twice, and “Lucky Dog” will not be on the list at all.

Using Paginators with Fractal

This is a rather specific example, requiring Laravel’s Eloquent and Pagination packages, and Fractal. If you are not using any of those things then you can skip it and just use some simple math like the example JSON above. Otherwise, follow on:

1 <?php

2 use Acme\Model\Place;

3 use Acme\Transformer\PlaceTransformer;

4 use League\Fractal\Resource\Collection;

5 use League\Fractal\Pagination\IlluminatePaginatorAdapter;

6

7 $paginator = Place::findNearbyPlaces($lat, $lon)->paginate();

8 $places = $paginator->getCollection();

9

10 $resource = new Collection($places, new PlaceTransformer);

11 $resource->setPaginator(new IlluminatePaginatorAdapter($paginator));

10.3 Offsets and Cursors

Another common pagination method is to use “cursors” (sometimes called “markers”). A cursor is usually a unique identifier, or an offset, so that the API can just request “more” data.

If there is more data to be found, the API will return that data. If there is not more data, then either an error (404) or an empty collection will be returned.

Empty is not Missing

I personally advise against a 404 because the URL is not technically wrong, there is simply no data to be returned in the collection so an empty collection makes more sense.

To try the same example:

1 {

2 "data": [

3 ...

4 ],

5 "pagination" : {

6 "cursors": {

7 "after": 12,

8 "next_url": "https://api.example.com/places?cursor=12&number=12"

9 }

10 }

11 }

This JSON has been returned after requesting the first 12 records; 1-12 were all available, and (for the sake of example) were all auto-increment integers, so, in this example, if we would like the content that is after 12, then the records having ID from 13 to 24 would be on the next page.

While this provides an incredibly simplistic explanation, generally speaking using IDs is a tricky idea. A specific record can move from one category to another, or could be deactivated, or all sorts of things. You can use IDs, but it is generally considered best practise to use an offset instead.

Using an offset is simple. Regardless of your IDs, hashed, etc, you simply put 12 in there and say “I would like 12 records, with an offset of 12”, instead of saying “I would like records after id=12”.

Obscuring Cursors

Facebook sometimes use cursors to obscure actual IDs, but sometimes use them for “cursor-based offsets”. Regardless of what the cursor actually is your user should never really care, so obfuscating it seems like a good idea.

Facebook Graph API using Cursors

Facebook Graph API using Cursors

How did Facebook get "MTA=" and "MQA==" as values? Well, they are intentionally odd looking as you are not meant to know what they are. A cursor is an opaque value which you can pass to the pagination system to get more information, so it could be 1, 6, 10, 120332435 or “Tuesday” and it wouldn’t matter.

Don Gilbert let me know that in the example of Facebook they just Base64 encode their cursors:

1 php > var_dump(base64_decode('MTA='));

2 string(2) "10"

3

4 php > var_dump(base64_decode('MQ=='));

5 string(1) "1"

Obfuscating the values is not done for security, but - I assume - to avoid people trying to do math on the values. Ignorance is bliss in this scenario, as somebody doing maths on an offset-based paginated result might end up doing the same math on a primary key integer. If everything is an opaque cursor or marker then nobody can do that.

Extra Requests = Sadness

This approach is not favored by some client developers as they do not like the idea of having to make extra HTTP requests to find out that there is no data, but this just seems like the only realistic way to achieve a performant pagination system for large data. Even with a “pages” system, if there is only 1 record on the last page and that record (or any other in any page) is removed then the last page will be empty anyway, so… every pagination system needs to respond to an empty collection.

Using Cursors with Fractal

Again this is a rather specific example, but should portray the concept.

1 <?php

2 use Acme\Model\Place;

3 use Acme\Transformer\PlaceTransformer;

4 use League\Fractal\Cursor\Cursor;

5 use League\Fractal\Resource\Collection;

6

7 $current = isset($_GET['cursor']) ? (int) base64_decode($_GET['cursor']) : 0;

8 $per_page = isset($_GET['number']) ? (int) $_GET['number'] : 20;

9

10 $places = Place::findNearbyPlaces($lat, $lon)

11 ->limit($per_page)

12 ->skip($current)

13 ->get();

14

15 $next = base64_encode((string) ($current + $per_page));

16

17 $cursor = new Cursor($current, $next, $places->count());

18

19 $resource = new Collection($places, new PlaceTransformer);

20 $resource->setCursor($cursor);

This will take the current cursor, use it as an offset, then work out the base64 version and convert it. There is a bit of work to do in this example because the Cursor class is intentionally vague. Instead of using an offset it could be a specific ID and you use it for an SQL WHERE id > X clause, but better not.