Replacing duplicates in a randomly generated array

Hello all! I’ve been working on a small C program to help me study. The basic idea is that I can randomly generate a problem set from a given range of questions in a textbook to a size I choose, and have the program spit out that set in numerical order (yes, I do realize that there are much easier ways to do this. I chose to use this as an opportunity to practice some coding as well :smiley: ). The only issue I’ve had is that the rand function does spit out duplicate numbers when I run it, so I’m wondering if anyone has any suggestions on how to remove and replace duplicate numbers within an array.

If it helps I’ve also attached my code thus far. It’s nothing too complex, but then I’m also majoring in EE, not Comp Sci, so I haven’t had much occasion to learn anything more advanced

/* Generate random problem set for a given range

input minimum number
input max number
input set size
generate problems of number set size within range
output in numerical order */

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {
	int setSize, max, min; //inputs
	int set[30] = {0}; //starting array
	int range; //range for random generation
	int gen, i, j, k; //indexes
	int temp; //bubble sort temporary storage

	//collect inputs
	printf("Minimum problem number: ");
	scanf("%d", &min);
	printf("Maximum problem number: ");
	scanf("%d", &max);
	printf("Size of problem set: ");
	scanf("%d", &setSize);

	//calculate range for generation
	range = max - min;
	//seed random number generator
	srand(time(NULL));

	//generate problem set to given size 
	for (gen = 0; gen < setSize; gen++) {
		set[gen] = ((rand() % range) + min); //rand generates 0 to range, adding min compensates for bottom edge of problem range
	}

	//bubble sort from least to greatest
	for (i = 29; i > 0; i--) {
		for (j = 1; j <= i; j++) {
			if (set[j-1] > set[j]) {
				temp = set[j-1];
				set[j-1] = set[j];
				set[j] = temp;
			}
		}
	}

	//display problem set, excluding 0s
	for (k = 0; k <= 30; k++) {
		if (set[k] > 0) {
			printf("%d
", set[k]);
		}
	}
}

This is a good example of time complexity.

The obvious easy way to “eliminate duplicates” is to just check each number you’ve picked so far, and if it is equivalent to your current number, skip it!

However, if this was for very large numbers it would not be good because it would have O(n^2) (“Big-O of n squared”) time complexity. It would have to do a set of instructions for all n elements fo the array, but for each element, it is checking at least some part of the (n-1) other elements, creating a polynomial. We take the largest term in the polynomial and simplify it to say that “The number of instructions processed in the worst case will be on the order of n^2”, or O(n^2).

Thankfully, in your case, I doubt you will be processing hundreds, thousands, or millions of questions, so we don’t have to think any harder about better ways to do it. I don’t have any off the top of my head right now, so I don’t have one!

EDIT: I thought of one. You could set up a binary tree containing the questions you’ve selected so far, and finding whether or not a number has been picked would be O(log n) making the entire algorithm O(n log n). This is way overcomplicated for your efforts :slight_smile: I just love algorithms!

One way to represent a set is by using a bit array:

  1. Create an array of size (max - min + 1).
  2. Initialize each element of this bit array to 0.
  3. When you generate a random number, x, check the value of the element in the bit array at index [x - min].
  • If the bit array value is 0, set the value at [x-min] in the bit array to 1.

  • If the bit array value is 1, then x already exists in the set, so generate another random number.

  1. Repeat the process until you’ve added setSize number of elements to the set.

I assume you have a known total number of questions.

How about you fill an array with all available question numbers. Then traverse through the array, switching the value of each position with a random position. This will randomize the array. (See a card deck shuffling program if you don’t understand.) Then pick the top N of them. Sort those N if desired.

Since you’re already sorting the problems into numerical order, you could just eliminate duplicates as you print out the list: “If the number I’m about to print is the same as the one I printed last time through the loop, just skip it”.

There are other ways to generate your list that won’t create duplicates. How about this, create an array of all of the possible questions, then do ‘n’ random swaps inside this array, then read however many questions you want from the start of the list. (essentially “shuffle” the list and then take the top ‘n’)

Or, create said array, and randomly pull questions out of the array (using rand to create an index in to the array). As you take a question, over-write that array entry with -1 and whenever you try to pull an entry that has -1 in it, re-roll another random index. (this probably isn’t a great approach…)

If you really need to take a set of data and efficiently eliminate duplicates, you can use std::set but I’d recommend doing without that until you learn more of the basics.

Without looking through your solution, I’ll make a few suggestions. This is also related to an interview question we give to SW interns.

One approach is to keep track of the numbers you’ve already used by searching and rejecting duplicates. This is actually a pretty bad approach as the execution time of your solution is now largely dependent on the behavior of rand(). In fact, with a sufficiently bad rand() or a large enough set of questions, your program may never finish.

A different approach is to build an index array from 1 to N where N is the number of problems to choose from and you want to randomly select M Problems from it. The first random number you generate is in the range of 1 to N inclusive. You remove that element from the index array and add it to your homework array. The index array is now N-1 in size, so you generate a random number from 1 to N-1, extract, and add it to homework array, etc. You continue to shrink the range of the random number and the index array at the same rate until you have extracted as many indices as you like – M of them. The index array only ever contains unclaimed problems, the execution time is constant, no duplicates, etc.

Another approach is to make the index array and perform an inplace perfect shuffle on it. You may want to google perfect shuffle algorithm for exposure to some of them. Then take the first M. Note that this is pretty closely related to the previous approach.

Greg McKaskle

Another possible technique, and probably the fastest executing, would be to generate an array of all possible questions (say 1-100), randomize it, and select the top however many items (say, 10). This would give you no duplicates and would not have time complexity issues. A logical way to think of this is randomly generating a number, where the range of numbers possibly generated is only those which have not yet been generated.

This is how a trivia gameon our website works.

*Simple (and very fast) TP7 implementation:

{ Choose 20 numbers from the set [1..10000] with no duplicates }

CONST
size=10000;
choose=20;

VAR
a: array[1..size] of word;

Procedure Initialize;
var i: word;
begin
for i:=1 to size do a*:=i;
end;

Procedure Shuffle;
var i,j,temp: word;
begin
for i:=1 to choose do
  begin
  j:=random(size)+1;
  temp:=a*; a*:=a[j]; a[j]:=temp;
  end;
end;

Procedure Select;
var i: word;
begin
for i:=1 to choose do writeln(a*:8);
end;

CONST
cr=#13#10;

BEGIN

write('Initializing...');     Initialize;
write(cr,'Seed the RNG...');  Randomize;
write(cr,'Shuffle...');       Shuffle;
write(cr,'Select...',cr);     Select;
write('done. press ENTER ');  readln;

END.

Add the following header:

{$APPTYPE CONSOLE}
USES Windows;

… and change “word” to “dword” (4 places), and you’ve got a 32-bit Delphi console app which will select 100 non-duplicate random values from a pool of one million in less than 0.004 seconds on an 8-year-old desktop PC.

I’m afraid I don’t understand what you’re getting at. I can follow your code, but I don’t recognize the language, nor do I see the purpose of your header

@Zholl:

The algorithm I posted works very fast for very large sample spaces. It does not require the overhead of removing samples from the array, nor does it require the overhead of shuffling the entire sample space.

The header is to allow 16-bit TP7 code to be compiled by Delphi into a 32-bit app that can run natively under Windows. You can code the algorithm in your language of choice.

I’ll note that I’ve used Greg McKaskle’s last solutions in several simulation and test applications in Java by using something similar to this:


LinkedList<Object> someList = getList();
Collections.rotate(someList, (int)(Math.random()*(someList.size() - 1)));
Object randomObject = someList.remove();

]Rotating is usually much faster than shuffling]Removing the element from the queue ensures lack of duplicates
If you want a good lesson in C, you could try implementing a basic generic rotate function and a basic generic LinkedList object. This isn’t the most efficient way in Java since the *Collections *class doesn’t know about the *Queue *implementation of the *LinkedList *(thus it could save time by manipulating indices rather than doing a rotation), but implementing your own solution would give you that flexibility.

What is TP7? Your code looks like standard Pascal (with unusual style choices for capitalization).

You may use an obscure programming language when it isn’t mentioned for the first 6 pages of google results…

*I’ll just call it pseudo-code from now on :slight_smile:

I just re-ran it with a larger sample space.

On an 8-year-old desktop PC running XP Pro, it took a little over 1 millisecond to select 10 thousand non-dup random samples from a 100 million sample space (neglecting the time required to populate the sample space).

A note about random number generation. My understanding…

Back in the day, over 15 or 20 years ago most ‘random’ numbers were software algorithms that were not that random. If you knew the algorithm, seeding and such you could predict the next ‘random’ number.

People that really needed random numbers built external circuits that essentially measured the kTb voltage across a resistor. Basically naturally occurring resistor noise.

Then about 15 years ago Intel basically embedded this circuit into the Pentium processor. Starting with a particular revision processor, you could read a register in the processor that would give you a random number, that was based on thermal noise, not a mathematical pseudorandom generator.

So now I’m curious. Does anyone know how to get to this function from Window or linux ?

edit: more on the topic is here http://http://en.wikipedia.org/wiki/Hardware_random_number_generator