mike fettis

@michael.j.fettis

Python: list de-duping, list of lists batches

Lists list list I love lists, just about everything can be a list of something. There are a thousand ways to solve problems with lists. Let’s talk about de-duping a long list and then cutting up that list into batches.
The problem encountered was a list of some 60,000 IDs and in that list there were 3 dupes. Then after getting a clean list, that list needed to be batched up with 5000 IDs per batch.

The initial idea was to build iterating loops to cycle through the list to find the dupes, using a recursive loop in a loop. This was going to be tedious and perhaps 15 lines of code, pretty standard stuff. To the google, and of course, ended up on stack overflow. There is a python functionality called “set”.

Use some python built in functionality and the problem solves itself.

Next up, time to create batched lists of a single list. Again something pretty simple but can be made even easier using a single yield and len() built in. Whats yield?

Yield is a keyword that is used like return, except the function will return a generator.
What is a generator?
Generators are iterators, but you can only iterate over them once. It’s because they do not store all the values in memory, they generate the values on the fly:

The way that we will use this is to create a separate def for the chunking and then pass the def the list and then length of the batch. The def will then run the for/yield until it runs out of data. WHAT? Here’s an example.

The first time the for calls the generator object created from your function, it will run the code in your function from the beginning until it hits yield, then it’ll return the first value of the loop. Then, each other call will run the loop you have written in the function one more time, and return the next value, until there is no value to return.

What this ends up doing it returning a list of lists containing all the values of the original list broken up into batches. There you have it, quick and dirty and much less painful than iterate loops done the hard way. Let python do python things and we can leverage the power of the language to chew up data.

Lists, while they are a simple concept the simplest concepts can be the most powerful. It is always important to understand the hard way of doing things and then once that is accomplished learn the quickest and easiest way possible. However when the shortcut breaks, or doesn't work for an implementation, it is important to know how it works so it can be fixed, or reworked another way.

More by mike fettis

Topics of interest

More Related Stories