Counting From Zero
I often hear this phrase: “programmers are counting from zero”. Not so long ago I actually decided to check if it is true and asked some programmers I know to count to ten out loud.
None of them counted from zero.
Well, this phrase is usually brought up when discussing various programming languages, which share the common idiom - zero-based array indexing. In this context, yeah, many programmers have to count from zero, except I’ve never heard someone refer to the first element of an array as the “zeroth element”. So, even in languages that are zero-indexed people still refer to array elements by their natural indexes, and only specify zero-based indexes when they refer to an actual code that accesses the array element. Yet, zero-based indexing is used in a lot of programming languages, and many programmers think that this is the case for all languages, mainly because it was the same for all languages they’ve tried, but it’s actually not the case. Though such languages are rare, so this assumption can be understood.
As you may know from my other posts, I often program in the Fennel language, which then compiles itself to Lua.
And Lua is one of these languages, which starts counting from one.
This throws off many programmers, and I often see in the #fennel
IRC channel messages, asking if Fennel fixes this design misfeature of Lua.
Fennel does fix a lot of design misfeatures of Lua, but one-based indexing is not one of them.
Kinda, I’ll talk about it a bit later.
But first, let’s figure out why some low-level languages use zero-based indexing.
Arrays and memory
Languages like C are quite efficient with memory, and also kinda unsafe because of that.
For instance, arrays in C are just pointers to a specific memory location, and the very start of the array matches its first element.
So the index of the array in C is actually a memory offset, and zero offset matches the start of the block of memory.
Then, to have very quick access to array elements, the index is added to the address of the array, and we get a new memory address that is just dereferenced for value access.
Which is very fast.
The arr[10]
syntax is nothing but a sugar over *(arr+10)
, as demonstrated by this code:
#include <stdio.h>
int main() {
int arr[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
printf("arr[4]: %d\n", arr[4]); // prints arr[4]: 4
printf("*(arr+4): %d\n", *(arr + 4)); // prints *(arr+4): 4
return 0;
}
C actually increments addresses differently, so +
on a pointer works not the same as plus on the integer, but that’s beside the point.
The point is, that arrays in C are contiguous segments of memory, which programmers took advantage of for a long time.
And a lot of languages were influenced by C, due to its popularity, or the fact that these languages are simply implemented in C.
But high-level languages may, or may not use the same memory layout for their arrays.
The key takeaway here is that there are two fundamentally different approaches to data structures - ones that are memory mapped, and others that aren’t. If your language has memory-mapped data structures, like arrays in C, then yes, zero-based indexing is probably what you want, but it’s not mandatory. I’m saying it’s not mandatory because it really doesn’t matter that much, a compiler may convert indexes to offset mapping 1 to 0 when accessing memory-mapped data, so you still can choose to use one-based indexing if you’re developing such a language. Especially, if you have non-memory-mapped data structures, that are sequential and can be indexed, like Clojure vectors that are based on persistent hash-array-mapped-tries. Thus, I’d say that zero-based indexing is not a requirement but a tradition, that most programmers don’t think about.
Why Lua tables use one-based indexing
Most languages, despite being high level or low level, use zero-based indexing, even if it doesn’t make sense in the context of the actual implementation of the data structure. In Lua, however, there are no arrays, only tables, and until a certain point tables were not storing data in one contiguous block for integer keys. And AFAIK it was one-based because the developers thought that this is easier to understand for non-programmers.
Table implementation has changed since, and includes array part storing sequential integer keys, mainly for optimization purposes, but because tables are not arrays the indexing was kept one-based. And because of a combined nature, Lua’s tables can be indexed from any integer key:
local t = {[-1]="a", [0]="b", "c", "d"}
print(t[-1]) -- a
print(t[1]) -- c
This only proves the point that in a high-level language you can index things however you want if the data structure supports it.
In the case of this table, the hash part stores keys -1
and 0
, and the array part stores the rest of the values in a contiguous block of memory.
This may seem like a huge foot-gun because you never know what keys are in the table, and how to iterate over it. If you want to iterate, as you would in C, then yes, the problem is real. Lua does support numerical loops, so if you know the minimal key in your table, and table length, you can do it like so:
local t = {[-1]="a", [0]="b", "c", "d"}
for i=-1,#t do
print(t[i])
end
-- a
-- b
-- c
-- d
As a small benefit, table length, obtained with #
operator here, always matches1 the last integer key.
E.g. the length of this table {"a", "b", "c"}
is 3
, so you can just iterate over it by indexes from 1
to 3
inclusively, not to #t-1
, as you would have to do in zero-indexed language.
But the thing is - you likely won’t be doing such loops, because these loops only make sense in the context of partial iteration, and tables in Lua almost never iterated partially.
And this actually applies to many other high-level languages, that have functions like map(f, coll)
and filter(f, coll)
- if you want to partially iterate an array you’ll likely filter
it first.
Lua doesn’t have such functions, but it still has a generic way to iterate over tables without thinking about indexes, as it provides two functions that return iterators over the table: pairs
and ipairs
:
for k,v in pairs {a=1, b=2, c=3, "foo", "bar"} do
print(k, v)
end
-- 1 foo
-- 2 bar
-- c 3
-- a 1
-- b 2
for i,v in ipairs {foo="bar", "a", "b", "c"} do
print(i, v)
end
-- 1 a
-- 2 b
-- 3 c
These functions are fundamental ways to traverse tables in Lua, and the reason one-based indexing is not a huge problem. It is unlikely that you’ll be using table indexing with numerical indexes, because data that has to be taken out usually comes in associative tables, and the array part usually is traversed as a whole. Still, if you do want to index sequential tables, you’ll have to remember that indexing starts from 1 in Lua. Here’s a function in Lua that sums the first three values of a table:
local function sum_three(t)
local a, b, c = t[1], t[2], t[3]
return a + b + c
end
sum_three({4, 5, 6, 7}) -- 15
A programmer not familiar with Lua indexed might make a mistake here, and write local a, b, c = t[0], t[1], t[2]
, thus getting nil + 4 + 5
as a result.
Fennel, on the other hand, while not changing the indexing of the table, actually helps you not to make mistakes by providing a feature, called destructuring.
It avoids this problem by not using indexes at all - here’s the same function in Fennel:
(fn sum-three [t]
(local [a b c] t)
(+ a b c))
(sum-three [4 5 6 7]) ; 15
The (local [a b c] t)
is a destructuring syntax, which is also known as pattern matching in languages like Erlang or Elixir, except it doesn’t do actual pattern matching here, only pattern binding.
What it essentially does is, it looks at the shape of provided data, [a b c]
in this case, and understands that in this case t
should be treated as an array, where each of the names are bound to the respecting elements from that array.
It is compiled to the same kind of code, as we’ve manually written in Lua before:
local function sum_three(t)
local a = t[1]
local b = t[2]
local c = t[3]
return (a + b + c)
end
Not as pretty as handwritten Lua code, but saves you the trouble of knowing what indexes to use, as now it is an internal thing. This syntax is supported everywhere in Fennel where names are bound, for example in the function argument list, so the function can be rewritten more idiomatically like so:
(fn sum-three [[a b c]]
(+ a b c))
The compiled Lua is again essentially the same as the one from the above.
You might wonder how to skip some elements in the array then?
You simply ignore those by using a _
as a name in the binding syntax:
(fn sum-some [[_ _ _ a b _ c]]
(+ a b c))
(sum-some [1 2 3 4 5 6 7]) ; (+ 4 5 7)
Yes, if you need to get the 489th element of the array, this won’t really scale, and at this point, you might want to use indexes, but I doubt that this will happen too often for you.
At least I’ve never needed to do that, or even to skip elements with _
like above, for that matter.
The fun part is that Fennel shares its syntax with Clojure to some extent, and Clojure, being a JVM-hosted language, uses zero-based indexing for its arrays. And thanks to the fact that destructuring is supported by both languages, I never get confused about what indexing to use when I switch between the two languages very frequently.
;; Clojure
(defn sum-three [[a b c]]
(+ a b c))
(sum-three [1 2 3])
;; Fennel
(fn sum-three [[a b c]]
(+ a b c))
(sum-three [1 2 3])
So while Fennel didn’t fix the indexing, it surely made it less of a problem (especially for those coming from Clojure).
So what was this all about
What I wanted to say by this post is that zero-based indexing is not really something that is often thought about too much by programmers nowadays. We accept that most languages use it, even if it doesn’t make sense in the context of the data structures it uses, and new languages usually stick to the same approach to avoid confusion. And, I also think, that if a language steps out of the line and tries to use one-based indexing, it may have a much harder adoption, due to programmers freaking out and making mistakes because of this choice.
Personally, if I were to develop a language for myself that doesn’t use memory-mapped data structures, I’d go for one-based indexing, because it makes more sense to me. And if I later added C-like arrays, I’d still keep the 1-based indexing for them, because it’s just the compiler’s internal knowledge of how the data structure is implemented. And one small benefit of doing iteration over indexes and not subtracting one from the array length is also kinda nice :)
Though, if I wanted my language to be more widely adopted I’d consider using zero-based indexing.
-
It’s not always the case if the table has gaps, but that’s a different story. ↩︎