The other day I was interviewing at one of the companies, and I was asked the following question, how can you count occurrences of a word in a 50gb file with 4gb of RAM. The trick is to not load the whole file into memory and keep processing each word as we keep on moving the pointer of the file. With this, we can easily process the whole file with a minimal amount of memory resources. Now the followup question was how can we speed up this process using multithreading? The solution is we keep multiple pointers at different parts of the file and each thread reads chunks of the file concurrently. Finally, the result can be combined. This simply shows how the whole file will be divided. And the various pointers of the file. Let's say the file is 1GB huge. Each of 5 threads will process 200MB. The consecutive pointer will start reading from the byte of the last read byte of the previous pointer. Implementation When it comes to multithreading the easier option which comes to mind is go-routines. I will walk you through a program that reads a large text file and creates a dictionary of words. This program demonstrates reading a 1GB file using 5 go-routines with each thread reading 200MB each. mb = * gb = * mb { wg := sync.WaitGroup{} channel := ( ( )) dict := ( [ ] ) done := ( ( ), ) { s := channel {
			dict[s]++
		} done <- }() current limit = * mb i := ; i < ; i++ {
		wg.Add( ) {
			read(current, limit, , channel)
			fmt.Printf( , i)
			wg.Done()
		}() current += limit + } wg.Wait() (channel) <-done (done)
} {
	file, err := os.Open(fileName) file.Close() err != { (err)
	} file.Seek(offset, )
	reader := bufio.NewReader(file) offset != {
		_, err = reader.ReadBytes( ) err == io.EOF {
			fmt.Println( ) } err != { (err)
		}
	} cummulativeSize { cummulativeSize > limit { }

		b, err := reader.ReadBytes( ) err == io.EOF { } err != { (err)
		}

		cummulativeSize += ( (b))
		s := strings.TrimSpace( (b)) s != { channel <- s
		}
	}
} const 1024 1024 const 1024 func main () // A waitgroup to wait for all go-routines to finish. // This channel is used to send every read word in various go-routines. make chan string // A dictionary which stores the count of unique words. make map string int64 // Done is a channel to signal the main thread that all the words have been // entered in the dictionary. make chan bool 1 // Read all incoming words from the channel and add them to the dictionary. go func () for range // Signal the main thread that all the words have entered the dictionary. true // Current signifies the counter for bytes of the file. var int64 // Limit signifies the chunk size of file to be processed by every thread. var int64 500 for 0 2 1 go func () "gameofthrones.txt" "%d thread has been completed \n" // Increment the current by 1+(last byte read by previous thread). 1 // Wait for all go routines to complete. close // Wait for dictionary to process all the words. close ) func read (offset , limit , fileName , channel ( ) int64 int64 string chan string defer if nil panic // Move the pointer of the file to the start of designated chunk. 0 // This block of code ensures that the start of chunk is a new word. If // a character is encountered at the given position it moves a few bytes till // the end of the word. if 0 ' ' if "EOF" return if nil panic var int64 for // Break if read size has exceed the chunk size. if break ' ' // Break if end of file is encountered. if break if nil panic int64 len string if "" // Send the read word in the channel to enter into dictionary. Over here we also have to handle an edge case. What if the start of the chunk is not the start of a new word. Similarly what if the end of a chunk is not the end of the chunk. We handle this by extending the end of the chunk till the end of the word and by moving the start of the consecutive chunk to the start of the next word. We are using channels to unify all the words read by various threads into a single dictionary. A sync. Waitgroup can be used for synchronization of threads and ensure that all the threads have completed reading the file Results It was observed that the performance was doubled and the time required to process the 1GB file was halved concurrently as compared to doing it in a serial manner. The reason that we did not get 5x performance i.e number of threads is that, although the goroutines are lightweight threads, the process of reading a file requires the resource of one whole os level CPU core. It has no sleep time. Hence on a dual-core system it effectively only doubles the performance of file processing.

Different

Keep

Mind

Leveraging Multithreading To Read Large Files Faster In Go

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

5 Important Lessons I Learnt As A Software Engineer

The Essential Guide to Relearning Java Thread Primitives

A Guide to Improving Your Python Performance Speed

A simple introduction to Python’s asyncio

Assume It Worked and Fix Later

Async/Await in Golang: An Introductory Guide

5 Important Lessons I Learnt As A Software Engineer

The Essential Guide to Relearning Java Thread Primitives

A Guide to Improving Your Python Performance Speed

A simple introduction to Python’s asyncio

Assume It Worked and Fix Later

Async/Await in Golang: An Introductory Guide

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps