Efficiently Handling Large Datasets in Go

Reading Time: 2 minutes

Reading and processing large datasets—whether from a database or external source—requires careful handling to avoid high memory usage, slow response times, and degraded application performance. This guide outlines focused, practical strategies for working with large datasets in Go, including real-world code examples.


1. Use Pagination

Avoid fetching an entire dataset at once. Instead, paginate using offsets or cursors. This is efficient for user-facing applications (e.g., tables or feeds).

Example (PostgreSQL with offset pagination):

func fetchPaginatedRows(db *sql.DB, limit, offset int) ([]MyRecord, error) {
	rows, err := db.Query("SELECT id, name FROM records ORDER BY id LIMIT $1 OFFSET $2", limit, offset)
	if err != nil {
		return nil, err
	}
	defer rows.Close()

	var results []MyRecord
	for rows.Next() {
		var r MyRecord
		if err := rows.Scan(&r.ID, &r.Name); err != nil {
			return nil, err
		}
		results = append(results, r)
	}
	return results, nil
}

2. Stream Data Using Rows Cursor

Go’s database/sql package streams data with rows.Next()—ideal for processing large datasets row by row without loading them into memory.

Example:

func streamLargeDataset(db *sql.DB) error {
	rows, err := db.Query("SELECT id, data FROM large_table")
	if err != nil {
		return err
	}
	defer rows.Close()

	for rows.Next() {
		var id int
		var data string
		if err := rows.Scan(&id, &data); err != nil {
			return err
		}
		processData(id, data)
	}
	return rows.Err()
}

3. Use Batch Processing

If streaming isn’t enough (e.g., you need to process in groups), fetch rows in batches.

Example with manual batching:

func fetchInBatches(db *sql.DB, batchSize int) error {
	offset := 0
	for {
		rows, err := db.Query("SELECT id, value FROM items ORDER BY id LIMIT $1 OFFSET $2", batchSize, offset)
		if err != nil {
			return err
		}

		count := 0
		for rows.Next() {
			var id int
			var val string
			if err := rows.Scan(&id, &val); err != nil {
				rows.Close()
				return err
			}
			process(id, val)
			count++
		}
		rows.Close()

		if count < batchSize {
			break
		}
		offset += batchSize
	}
	return nil
}

4. Compress or Optimize Network Payloads

When fetching over the network (e.g., REST API), use gzip encoding to reduce transfer size.

Go HTTP client with gzip:

client := &http.Client{}
req, _ := http.NewRequest("GET", "https://api.example.com/large-data", nil)
req.Header.Set("Accept-Encoding", "gzip")

resp, err := client.Do(req)
if err != nil {
	log.Fatal(err)
}
defer resp.Body.Close()

var reader io.Reader = resp.Body
if resp.Header.Get("Content-Encoding") == "gzip" {
	reader, _ = gzip.NewReader(resp.Body)
}
processStream(reader)

5. Cache Results if Applicable

If the dataset doesn’t change frequently, consider caching.

Example using in-memory cache (sync.Map or third-party like groupcache):

var cache sync.Map

func getCachedData(id string) (string, bool) {
	val, ok := cache.Load(id)
	if ok {
		return val.(string), true
	}
	// If not found, fetch and store
	data := fetchFromDB(id)
	cache.Store(id, data)
	return data, false
}

6. Use Asynchronous Processing

For massive workloads, decouple fetching from processing using goroutines or job queues (e.g., using channels or background workers).

Example with worker pool:

func startWorkerPool(jobs <-chan int, results chan<- string, wg *sync.WaitGroup) {
	defer wg.Done()
	for id := range jobs {
		data := fetchAndProcess(id)
		results <- data
	}
}

7. Monitor and Profile Performance

Use Go tools like pprof, expvar, and database monitoring tools (e.g., pg_stat_statements) to track performance.

Basic memory profiling setup:

import _ "net/http/pprof"

go func() {
	log.Println(http.ListenAndServe("localhost:6060", nil))
}()

Then visit http://localhost:6060/debug/pprof/ for live profiling data.


Summary

StrategyBenefits
PaginationReduces load and memory usage
StreamingProcesses rows one at a time, low memory footprint
BatchingAllows controlled, grouped processing
CompressionReduces data transfer size
CachingReduces redundant fetches
Async ProcessingImproves responsiveness
MonitoringIdentifies bottlenecks early

By combining Go’s built-in features (like streaming, goroutines, channels) with smart database practices (pagination, indexing, batching), you can reliably and efficiently handle large datasets in scalable applications.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *