Data Structures and Algorithms


Loading ....
 

Concepts of data structures: tables and linked lists

Now that you have your first notions about algorithmic complexity, it’s time to introduce the concept of data structure that you were made to glimpse in the introduction. Just like the first part, we will not do anything complicated at the moment, but the bases presented here will be useful for the continuation of the course.

We will focus on two extremely common structures: lists (simply linked) and tables .

Definition

The basic principle of a data structure is to store elements that the programmer wants to access later. The different uses of the data structure of operations are called .

For example, a common operation is reading : we want to retrieve an element stored in the structure. There are many other operations, such as inserting , adding a new element to the data structure, or deleting it , removing it.

All data structures do not allow the same operations, and above all they do not always have the same cost . For example, on some structures, it is very fast to add an element, in others it is difficult and it may require a complete reorganization. The cost of data structures can be measured quite finely, but what interests us in this course is the complexity : for each data structure used, we will try to have a good idea of ​​the complexity of the operations that the we perform.

This knowledge of costs has two advantages: when we are given an algorithm using a particular data structure, we need to know the cost (complexity) of the operations performed to evaluate the overall complexity of the algorithm. But most importantly, and this is probably the most interesting aspect, when we have an idea of ​​the operations that we need for an algorithm, we can choose the most appropriate data structure (the one for which these operations are the least expensive).

In practice, most people use fairly simple algorithms (which are not based on sophisticated manipulations), where the only choice of the right data structure can make a difference in performance. Knowing your data structures and knowing how to make a choice plays a very important role for the programmer.

Paintings

The table is probably the most common data structure, at least in derived languages ​​or inspired by the C language .

The principle of a table is very simple: we store the elements in boxes, each box being labeled with a number (usually called index ). To access a particular element of a table, give its index.

The indices are consecutive integers, and we will consider that they start at 0, as in most programming languages. The first element is therefore index 0, the second index 1, etc. (watch out for the shift) In particular, if nthe size of the array is the last element is at the index n-1. Requesting access to an index that does not exist causes an error.

Data Structures and Algorithms

We consider that the size of a table is always known (the programmer had to know it when he asked for the creation of the table, and then it is enough to keep it).

The time of creation of the array depends on the languages. In general, the creation function puts in each box a default value, and its execution time is then proportional to the length of the array (so in complexity O (N) if N is the size of the array). However, it is possible in some languages ​​to create “uninitialized” tables (with unknown values ​​for each box) more quickly. It is a practice that can be dangerous because sometimes we have values ​​that make no sense in our table. We will consider here that all the tables are initialized as soon as they are created, and therefore that it is always in O (N).

lists

The list is an extremely used data structure. In particular, it played a major role in the Lisp language and remains very present in the many languages ​​that have inspired it.

Note: To correctly describe a list, I am forced to move slightly away from pure algorithmic considerations, to detail a little more precisely how programming languages ​​manage data structures. I will use here an independent description of the programming language: we will speak of cells having one or more fields . A cell is simply a set of boxes that can store data, and which is accessed by name (which is called field name , such as form fields for example). For example, we can describe a structure that stores a person’s address and number as a three-field cell nom.adresseand téléphone.

Depending on your language, the cells will have different shapes: structin C, objects in Java, etc. : It’s up to you to choose the translation you like most. We will not go into low level details about the memory organization either, but it is good to have a common model to discuss.

It will be further considered that the creation (and destruction) of a cell (if it has a fixed number of fields), as well as the reading or the modification of a field, is done in constant time.

Let’s come to the definition of a list. A list is:

  • either the empty list;

  • a two-field cell, a field têtecontaining an element, and a field queuecontaining the address of another list.

In other words, a list is either empty or an element followed by a list . This definition is fun because it is recursive : it uses the word “list”. The definition of a list uses the definition of a list! But, just as recursive programs do not all loop infinitely, this definition is not a vicious circle, and it is quite correct (you’ll see it in use).

A list can be the empty list (0 elements), or an element followed by the empty list (1 element), or an element followed by an element followed by the empty list (2 elements), and so on.

a list of 3 elements

We will say that the element in the field têteis the head of the list, and that the list in the field queueis its tail. The queue of a list contains all the elements of the list, except the first one. For example, the tail of the list [1; 2; 3]is [2; 3]. Of course, the empty list has neither tail nor head: trying to access one of these fields causes an error.

There are actually many variations of this structure. I have described here the simplest, which is also called simply linked list because cells contain only, in addition to one element, an arrow to the next (we imagine that these arrows form a chain). Other structures are useful in particular cases (for example we can put two arrows per cell, one to the next element, and one to the previous element, it is the principle of the doubly linked list ), but that This is the most common and the most useful.

Implementation : the representation of chained lists depends very much on programming languages. In functional languages, this type is already defined (for example in Caml one notes []the empty list and tete::queuethe cell whose head is teteand the tail.In queueC, one must build it oneself.We use a very classical representation: a structure for the cells, and the pointer NULLfor the empty list:

struct list
{
int val;
struct list *next;
};
typedef struct list List;

The lists are then represented as pointers to a cell:

List *une_liste;

As in C, we must allocate and free the memory by hand, we will also need a function that frees all the cells of a list:

void free_list(List *list)
{
while (list != NULL) { /* tant que la liste n’est pas vide */
List *cell = list;
list = list->next;
free(cell);
}
}

At each stage of the loop while, we store the head of the list in a variable cell, we advance in the list ( listbecomes the tail of the list), and we release cell. We have to use this intermediary variable, because if we started with free(list), then it list->nextwould be meaningless (since listit was erased) and we could not move past the list.

Add / remove, size, access to an item

Add / Remove

What is the easiest way to add or remove an item to a table or list?

For a list, these two operations are very simple as we consider only an addition (or a deletion) at the top of the list: to delete the head element, simply replace the list by its tail (which is the list of all the following items). To add an element at the top, just create a new cell, put this element in the field tête, and the start list in the field queue.

add an item to a list of 2 items

These two operations are done in constant time (the first is to read a field, and the second to create a cell), so their complexity is in O (1).

Note: Adding to the top of the list (that is, creating a new list from an item and an old list) is fundamental to list manipulation. It has a specific name, cons(read “consse”), which even gave rise to a verb (used only in computer science) in English, to cons . It is fundamental because in a way it is part of the definition of lists, which can be rephrased as either an empty list or a list cons.

Implementation: In languages ​​where lists already exist, it is extremely simple to define cons. For example in Caml:

let cons tete queue = tete::queue

Otherwise, you must use the type that you have defined yourself. In C, we must also take care of the memory allocation:

List *cons(int valeur, List *liste)
{
List *elem = malloc(sizeof(List));
if (NULL == elem)
exit(EXIT_FAILURE);
elem->val = valeur;
elem->next = liste;
return elem;
}

For tables, the question is more delicate. The size of an array is fixed in advance, it is not possible to add elements (simply because there is not necessarily any available space in the memory, on the edges of the table, for to be able to enlarge it). The safe way to add one (or more) elements is to create a bigger table somewhere else, which contains enough room for all the old elements and the new one (s), and to copy the old elements into the new table, before adding the new ones. This method requires the creation of an array of size N + 1, then a copy of each element of the array, so it is in O (N) (where N is the size of the array before insertion), or linear. Similarly, the size of a table being fixed in advance,

Note: In some languages, it is possible to try to resize the tables on the spot in some cases, or to eliminate elements that are at the beginning or end of the table. This is still rather risky, and we will not consider these operations.

Cut

When it comes to calculating the size of the data structure, it is the table that has the beautiful role. Indeed, we consider that the size of an array is always known, so there are no calculations to do to obtain it: it is an operation in O (1).

For a list, we do not usually know the size of a list (especially if we have just added or removed many items at the top of this list). To calculate the size of a list, we apply the following algorithm:

  • if it is the empty list, its size is 0;

  • otherwise, we calculate the size of its tail, and we add 1.

Thus, we will go through the list until we reach the empty list, adding 1 for each element. This method works very well, but requires a complete path of the list, so is in O (N) (where N is the size of the list).

Remark: as for the tables, it would be possible to store the size of the lists in the structure itself, instead of having to calculate it each time: in addition to having têteand queue, one would add to each cell a field taillewhich would contain the size of the list. The problem of this method is that the operation consbecomes more expensive: when we create a new cell for the element to add, we must put the new element and the tail as before, but then we must access the first cell of the queue, to read the size N of the old list, to be able to put N + 1 in the fieldtailleof the new cell. It just adds a step (more precisely, two cell reads, an addition and a field initialization in addition), so the operation remains in O (1), but it still slows down the operation substantially, which is embarrassing when you use a lot cons. In practice, most people use a lot cons, and very rarely need the size of the list; this “optimization” is not interesting because it would slow down the program. Once again, we find the central idea, which is to choose its data structures according to the use we want to make, so that the most common operations are as fast as possible.

Access to an element

How to do if we want to recover for example the fifth element of our collection (list or table)? For a table, it’s simple: we ask the element of index 4 (pay attention to the shift), and we get it immediately. This operation is in O (1).

For a list, it is more difficult: when we have a list, we have direct access to the first cell, so we only know his head, and his tail; we can give quickly only the first element. But in fact, we can also have access to the second: it is the head of the queue of the list :Wizard:! And third: the head of the queue tail of the list. In fact, we look for the head of the tail of the tail of the tail of the tail of the list. Too easy.

Here is an algorithm to retrieve the index element from na list:

  • if n = 0(we ask the first element), return the element that is in the field tête;

  • otherwise, return the element that is in the index n-1to the list that is in the field queue.

You can notice that we directly consider our list as a cell: if the list is empty, we can not retrieve any element, so it is an error.

To access an item, you must go through the entire list to the desired position. To access the index element kyou have to do about koperations. What is the complexity of the operation? As explained in the first part, one has to be pessimistic and consider the complexity in the worst case: in the worst case, one looks for the last element of the list, so one has to go through it all. The operation is therefore linear, in O (N).

You have probably noticed the big difference between the problem of access to the firstelement, and access to “any” element. In a list, the first operation is in O (1) ( :Sun:) and the second in O (N) ( : '().

To differentiate them, computer scientists have a specific term to say “access to any element”: they speak of arbitrary access . Many data structures can access certain privileged elements very quickly, but are slower for arbitrary access. Tables have the property of having arbitrary access in constant time, which is rare and very useful in some cases.

Note: You may not know the term “arbitrary access”, but you may have already encountered its English equivalent, random access . Or, you never wondered, while fiddling with the RAM of your computer, what meant RAM: Random Access Memory, random access memory.

The problem of accessing a list is not limited to reading the element at a given position: you might also want to add or remove an element at this position. These algorithms are close to that of reading, and they too have a linear complexity.

A little anecdote the change made that was accepted by the author of the video game in question.to illustrate the importance of studying complexity: when we do not work on this tutorial, we sometimes play. Among these games, one of them had a loading time of 90 seconds when it was necessary to generate a new map of the world. A little surprised, and given that the source code of the game was available, we studied the faulty functionality. The game spent 88 seconds randomly accessing the items in a list! By transforming this list into a simple array, the loading has become almost instantaneous. : Ninja:The most curious can go to study

Concatenation, filtering

Concatenation

Let’s imagine that we have two groups of elements, stored in two different lists (or two tables), and that we want to bring them together. We want to build a structure that is in a way the “sum” of the two starting structures. This operation is called “concatenation” (it comes from Latin to “chain together”).

For paintings, it’s quite easy: if the first painting is A, and the second B, and we note L the size of A and L ‘(read “L prime”) the size of B, we create an array of size L + L ‘, where we copy all the elements of A, then all the elements of B. This requires L + L copies (and the creation of L + L’ boxes): the operation is in O (L + L ‘).

concatenation of tables

Note: here I gave the complexity according to two variables, L and L ‘. I had previously defined complexity as dependent on a single variable, but it is a special case. The complexity of an algorithm can depend on all the parameters on which the algorithm depends, and not just one. Moreover, complexity makes sense only when the variables we use to express it are well defined: saying O (N3) is not enough, we must make sure that everyone understands what the variable N (although in general, it is obvious and left implicit).

For a list, the situation is a little different: as we can easily add an item at the top of the list, we can also add a sequence of elements. So just add all the elements of A at the top of list B. This amounts to making a copy of A before B. You can already guess (the algorithm will be specified later) that as we add L (the size of A) elements at the top of B, the complexity will be in O (L).

concatenation of lists

Here is a more detailed algorithm that concatenates A and B:

  • if it is empty, we have nothing to do: we return the second list, B;

  • if it’s a cell, we proceed in two stages:

    • we calculate the concatenation of its tail with B,

    • we add the head to this result;

    We can summarize this by cons(tete(A), concat(queue(A), B)).

concatenation of lists, detail

Again, this function is recursive, I invite you to check that it works well by implementing it yourself in your preferred language. What is its complexity? We will call the function concatonce on A, then on queue(A), then on queue(queue(A)), etc., until we reach the empty list. In other words, we will have called as concatmany times as there Aare elements. The rest of the operations (performed at each call of concat) is a cons(and the reading of the head), so in O (1). To make L (where L is the size of A) times an operation in O (1), ie L times a operation in constant time, puts a time proportional to L. It is in O (L) .

Note: with this algorithm, we copy (by cons) each cell in list A: list B is left unchanged, but we created L cells. You may have noticed that another way of doing things would be possible: we could take the last arrow of A (the one pointing to the empty list) directly, and modify it to point it to the first cell of B. This method has the advantage of not requesting copy of A’s cells, but also a major disadvantage: it modifies list A. If you had a variable that designated A before the operation, it now refers to concat(A, B). List A, in a way, was “destroyed”.

concatenation of lists, destructive algorithm

This behavior, which is called an edge effect , can give rise to bugs if you are not careful (for example, if you think you still have access to A, when in fact you are handling concat(A, B)). If you eliminate the negligence of the programmer (because you are sure that you , you do not make such mistakes – haha!), It can still pose tricky problems in the case of multi-threaded applications for example (a thread calculates the number of items in your list, but just before it uses it to do something, another thread silently modifies the list by adding a lot of items at the end; by your first thread is no longer valid: boom!).

Overall, the presented algorithm, which has the property of not modifying the starting lists A and B, is much safer and more convenient to use. There are other formulations, but in any case they all have the same complexity.

You may note that the concatenation of two lists does not depend on the size of the second list, which is kept identically, but only of the first one. For arrays, concatenation depends on both. This is a difference that can be very important if you want to concatenate very often small lists (or small tables) to a large list (or a large table). In the case of paintings, this operation will be very expensive since you will have to copy the big painting each time. In practice, it is therefore quite rare to concatenate tables, while the operation is more common for lists.

filtering

Here is a last operation that occurs regularly when you manipulate data: select a part of them. For example “among the people I know, I want the name of all those who speak German”. In computer science, we will represent the question “does he speak German or not?” by a function: it takes a person in parameter, and returns true(true) if it speaks German, false(false) otherwise. I call these functions “choice functions” (sometimes called predicates ). We want to perform a filtering operation: Given a collection (list or array) containing elements, and a function of choice on these elements, you want to recover only the elements for which the choice function returns true.

If we use tables (especially if we want the results of filtering to be stored in a table), we are faced with a problem: we do not know a priori what will be the number of elements to return. You do not know a priori, without thinking or asking them the question, how many of your acquaintances speak German. There are many possibilities. I’ll mention one for now (I’ll talk about a second then, and anyway if you thought of another, it’s fine).

The first possibility is to start from a size 0 (empty, what) table, enlarging it every time you find a new Germanist in your knowledge. As we have seen, enlarging a table usually requires as many copies as there are boxes. In the first person found, you will not make any copies (create a size 1 chart to put the person). In the second, you will make a copy (the first person found). In the third, you will make 2 recopies. In the end, if the filtered array has K elements, it will have been built by doing 0, then 1, then 2, …, then K-1 copying, ie 0 + 1 + 2 + … + (K-1 ) recopies in total, that is to say approximately K2 / 2 recopies. In the worst case, K equals N, the size of the starting array ( allyour knowledge speaks German!), and you have about N2 / 2 operations: this algorithm is in O (N2).

We can obtain an interesting algorithm for the lists by applying exactly this method, but by using lists instead of tables: we start with an empty list, and for each interesting element (that is to say which makes refer trueto the choice function), we add it to the top of the list (by one cons). Each consis in O (1), and in the end we make it to the maximum N. The algorithm using a list is therefore in O (N). It is quite striking that by using exactly the same algorithm, very different complexities can be achieved simply by changing the data structure. This illustrates the fact that the choice of structures is important, and that the programmer must be aware of them.

To save the honor of the tables, it is necessary to present another algorithm with a complexity less bad than O (N2). We can simply browse through our collection of elements to be filtered a first time to count the number of interesting elements, create an array of this size, then browse it a second time by adding the interesting elements in the table. We do two courses, but the number of operations remains proportional to N, so this algorithm is indeed in O (N). I have probably said many times that we would only be interested in complexity, but it is time to make an exception (because if we never did, it would not be funny): this algorithm requires two paths of the starting collection, so even if it has the same complexity as the algorithm using lists, it is much less interesting, and in particular it will generally be slower. It is also a little less resistant to various odd situations that could arise: if the starting table is changed between the two courses, this can be problematic; moreover, the function of choice will be called twice by element instead of one, which can be very annoying if it does odd things (for example if it stores the interesting elements by refusing the elements already met). These are problems that can be remedied, but all of this involves additional complications, and possibly performance degradation. if the starting table is changed between the two courses, this can be problematic; moreover, the function of choice will be called twice by element instead of one, which can be very annoying if it does odd things (for example if it stores the interesting elements by refusing the elements already met). These are problems that can be remedied, but all of this involves additional complications, and possibly performance degradation. if the starting table is changed between the two courses, this can be problematic; moreover, the function of choice will be called twice by element instead of one, which can be very annoying if it does odd things (for example if it stores the interesting elements by refusing the elements already met). These are problems that can be remedied, but all of this involves additional complications, and possibly performance degradation.

Synthesis

operations

surgery

board

listing

arbitrary access

O (1)

We)

adding

We)

O (1)

cut

O (1)

We)

concatenation

O (n + m)

We)

filtering

We) : Uh:

We)

An overview of these two data structures can be found: the list is a structure to which it is very easy to add or remove (for example filtering) elements, while the table is very efficient when the number of elements does not change and we want arbitrary access.

Depending on the situation, you will need to use one or the other instead. In general, it is good to use a list when you have no idea of ​​the exact number of items you will handle (for example, if you are filtering, or plan to add items regularly). On the other hand, you do not have arbitrary access: you can always save certain elements of the list in separate variables if you need them very often, but you can not pick up specific elements in the middle of the list directly. : the only method of access is the route of all the elements (or at least all the elements at the beginning of the list: you can stop the course along the way).

It can be difficult at first to know which data structure to choose in a particular case. Even if you have made a choice, stay tuned for the operations you are doing. For example, if you find yourself often asking for the size of a list, or conversely trying to concatenate tables frequently, it may be time to change your mind. Some languages ​​offer facilities for manipulating arrays, not for lists (which must be built by hand, for example in C): if you do not have a library for your convenience, choose the structure that is easy to use (in many cases it is possible to mimic what one would do naturally with a list by clumsily using a table).

conversions

Finally, we must know that the choice of data structures, it is not for the whole life. Data structures are just a way of storing information, and just as you may from time to time rearrange your office or your home, it’s possible to change your organization. that is, moving from one data structure to another, retaining the stored information.

Exercise: Write a function that converts a list into an array, and a function that converts an array into a list. Both functions must be in O (N).

Moving from one data structure to another can be a very good idea if your program goes through several well-separated phases that use very different operations.

For example, you can start your program by gathering information (many additions of elements, concatenations, filtering to eliminate bad elements, etc.), followed by a second half of the heavy processing program. information collected (with routes in all directions, many arbitrary access, all without adding or removing elements). In this case, it is quite natural to use a list initially, and to convert it at the beginning of the second phase into a table. You can combine the benefits of both structures for your program.

Obviously, the conversion has a cost, and is therefore only interesting if you expect to gain a lot of performance by doing it. No need to go from one structure to another without stopping, doing very little every time. Not all programs are cut into such distinct phases and the choices will sometimes be quite delicate. Again, it’s up to you to guess what’s best, and then test. Feel free to try several different configurations to see what works best.

You will see later in the course other structures, with different profiles, which may be useful for intermediate cases. In particular, there are hybrid structures that allow easy access to a particular element (a little less quickly than with a table, but much more than with a simple list), but also add or remove elements (a little less quickly than at the top of the list, but much more than with a simple table). However, these structures are usually more complicated than lists and tables.

Watch out for ugly languages :Zorro:

Once you have read this chapter, you have (perhaps) understood all the differences between arbitrary access to O (1) and linear access to O (N), and you are about to make a massacre in your favorite programming language: “I’ll be able to index two billion extra pages a day!”

Unfortunately, it happens that some languages ​​do not work exactly as I described here. Instead of having highly specialized data structures, such as lists and tables, they prefer data that behave “not too badly” on most common operations (or rather that language designers have found to be the most common ones). ), at the cost of being complicated and unknown data structures, and sometimes behaving much worse than you expected. The problem is that these surprise structures also have the annoying habit of taking innocent names as “lists” or “tables”.

For example, the “lists” of Python are actually strange tables, and the “tables” of PHP are not tables at all (you should have known it by noting that they propose operations “add at the beginning” , “add at the end”, “remove at the beginning”, etc.). Other languages, often so-called “scripting languages”, have such cavalier policies on data structures.

Race result, always be wary before using a data structure. Try to find out about the algorithmic complexity of the operations you are interested in to see if this is what you expect. On some algorithms, going from O (1) to O (N) for an operation can be very painful!

There is a method to identify “risk data structures”, that is, those that are not quite real tables or lists, but disguised hybrid data structures: their interface. We have seen, for example, that a table does not support element insertion very well: if the library of “tables” of your language proposes to insert and delete elements as if it were a natural operation, then this are probably not real paintings.

These “ugly” languages ​​(a more or less affectionate name for languages ​​that make the ‘simplicity’ of programming go before anything else, including the most elementary (algorithmic) efficiency management) sometimes make it possible to use the “true” data structures, in a specialized library. For example, Python lists offer arbitrary access, but it is better to use (possible) the “real” tables in the library array.

This is also something you will need to consider if you ever create your own library of data structures (yes, that happens!). If an operation is too expensive for your organization, you do not have to offer it to users (who might think they are encouraged to use it): a conversion function to a more traditional structure that effectively supports this operation will case. In general, it is important to clearly specify the expected complexity of the operations you offer to users.

 


Loading ....
 

Be the first to comment