Refactor profiles scraper api #111 by AJayaV06 · Pull Request #168 · UTDNebula/api-tools

AJayaV06 · 2026-04-19T21:15:37Z

No description provided.

Copilot

Pull request overview

Refactors the profiles scraping/parsing pipeline to use the UTD Profiles API (instead of browser scraping), adds profile loading from the new raw JSON format, and improves professor record merging/link consistency during parsing.

Changes:

Replaced Chromedp-based profile scraping with HTTP calls to the Profiles API, batching detail requests and writing a raw profiles.json.
Added profile JSON decoding + loading logic in parser, plus helpers to pick “best” fields (location/URI/image/titles) when building professors.
Added a post-parse synchronization step to keep professor↔section references consistent, along with new unit tests for key helpers.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`scrapers/profiles.go`	Implements API-based profile scrape, batching, dedupe, and pretty JSON output.
`scrapers/profiles_test.go`	Adds tests for request URL building and profile dedupe behavior.
`parser/profiles.go`	Introduces profile-row decoding and professor↔section link syncing utilities.
`parser/profiles_test.go`	Adds tests for selecting best information/URIs/images and building professors from API rows.
`parser/profileLoader.go`	Adds `LoadProfiles` for the new raw profile JSON shape and a `ParseProfiles` entrypoint.
`parser/professorParser.go`	Adds profile-row → professor builders/merging and email-based professor matching during section parsing.
`parser/parser.go`	Calls professor↔section link sync after parsing sections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-19T21:20:08Z

+// syncProfessorSectionLinks ensures professor->section links match parsed section references.
+// Course connections are derived transitively through each section's Course_reference.
+func syncProfessorSectionLinks() {
+	if len(Professors) == 0 || len(Sections) == 0 {
+		return
+	}
+
+	for _, prof := range Professors {
+		if prof == nil {
+			continue
+		}
+		if prof.Sections == nil {
+			prof.Sections = []primitive.ObjectID{}
+		}
+	}
+
+	for sectionID, section := range Sections {
+		if section == nil {
+			continue
+		}
+
+		for _, profID := range section.Professors {
+			profKey, ok := ProfessorIDMap[profID]
+			if !ok {
+				continue
+			}
+			prof, exists := Professors[profKey]
+			if !exists || prof == nil {
+				continue
+			}
+			if !containsObjectID(prof.Sections, sectionID) {
+				prof.Sections = append(prof.Sections, sectionID)
+			}
+		}
+	}
+
+	for _, prof := range Professors {
+		if prof == nil {
+			continue
+		}
+		prof.Sections = dedupeObjectIDs(prof.Sections)
+	}
+}


syncProfessorSectionLinks claims to ensure professor->section links “match” section references, but the implementation only adds missing section IDs and dedupes; it never removes stale IDs already present on prof.Sections. This can leave incorrect links if prof.Sections was pre-populated or merged. Consider rebuilding each professor’s Sections slice from Sections[...].Professors (or explicitly filtering to only sectionIDs that reference that professor), or rename/reword the function/comment to reflect additive behavior only.

Copilot · 2026-04-19T21:20:08Z

+	payload, err := io.ReadAll(fptr)
+	if err != nil {
+		log.Printf("Failed to read profiles JSON: %v", err)
+		return false
+	}
+
+	rows, err := decodeProfileRows(payload)
+	if err != nil {
+		log.Printf("Failed to decode profiles JSON: %v", err)
+		return false
+	}


LoadProfiles reads the entire profiles.json into memory (io.ReadAll) and then attempts two full json.Unmarshal passes. This is a potential memory/CPU regression versus the existing streaming decoder approach in loadProfiles, especially since the new scraped payload can be large (information/areas/media). Consider decoding with json.Decoder (streaming) and/or decoding once into a lightweight envelope (e.g., json.RawMessage for profile/count) to avoid holding and parsing the full payload multiple times.

Copilot · 2026-04-19T21:20:08Z

+		if profByEmail, emailMatch := findProfessorByEmail(email); emailMatch {
+			profByEmail.Sections = append(profByEmail.Sections, sectionId)
+			if profByEmail.Email == "" {
+				profByEmail.Email = email
+			}
+			if _, exists := ProfessorIDMap[profByEmail.Id]; !exists {
+				ProfessorIDMap[profByEmail.Id] = profKey
+			}
+			Professors[profKey] = profByEmail
+			profRefs = append(profRefs, profByEmail.Id)
+			continue
+		}
+
+		prof = newProfessor(firstName, lastName)
 		prof.Titles = []string{utils.TrimWhitespace(match[2])}
-		prof.Email = utils.TrimWhitespace(match[3])
+		prof.Email = email
 		prof.Sections = []primitive.ObjectID{sectionId}
 		profRefs = append(profRefs, prof.Id)
 		Professors[profKey] = prof
 		ProfessorIDMap[prof.Id] = profKey
 	}
 	return profRefs
 }
+
+func findProfessorByEmail(email string) (*schema.Professor, bool) {
+	normalized := strings.TrimSpace(strings.ToLower(email))
+	if normalized == "" {
+		return nil, false
+	}
+
+	for _, prof := range Professors {
+		if prof == nil {
+			continue
+		}
+		if strings.TrimSpace(strings.ToLower(prof.Email)) == normalized {
+			return prof, true
+		}
+	}
+
+	return nil, false


findProfessorByEmail performs a full scan over the Professors map for every unmatched instructor name. During section parsing this can become O(P*I) where P is professor count and I is instructor matches, which may be noticeable on full-term datasets. Consider maintaining an email->*Professor index map (normalized lowercase) that is updated when professors are created/merged, so lookups are O(1).

Copilot · 2026-04-19T21:20:08Z

+// syncProfessorSectionLinks ensures professor->section links match parsed section references.
+// Course connections are derived transitively through each section's Course_reference.
+func syncProfessorSectionLinks() {
+	if len(Professors) == 0 || len(Sections) == 0 {
+		return
+	}
+
+	for _, prof := range Professors {
+		if prof == nil {
+			continue
+		}
+		if prof.Sections == nil {
+			prof.Sections = []primitive.ObjectID{}
+		}
+	}
+
+	for sectionID, section := range Sections {
+		if section == nil {
+			continue
+		}
+
+		for _, profID := range section.Professors {
+			profKey, ok := ProfessorIDMap[profID]
+			if !ok {
+				continue
+			}
+			prof, exists := Professors[profKey]
+			if !exists || prof == nil {
+				continue
+			}
+			if !containsObjectID(prof.Sections, sectionID) {
+				prof.Sections = append(prof.Sections, sectionID)
+			}
+		}
+	}
+
+	for _, prof := range Professors {
+		if prof == nil {
+			continue
+		}
+		prof.Sections = dedupeObjectIDs(prof.Sections)
+	}
+}


syncProfessorSectionLinks mutates global state (Professors/Sections/ProfessorIDMap) and is now invoked by Parse, but there are no tests covering its behavior (e.g., adding missing links, deduping, and handling missing/mismatched ProfessorIDMap entries). Adding a focused unit test would help prevent regressions since this impacts the professors.json output.

AJayaV06 added 3 commits April 19, 2026 16:13

Update profiles parser and related staged changes

7bf0083

Simplify profile scraping output and cleanup

e3c23e8

Chnaged the professor parser logic

cdd7696

Copilot AI review requested due to automatic review settings April 19, 2026 21:15

Copilot started reviewing on behalf of AJayaV06 April 19, 2026 21:16 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor profiles scraper api #111#168

Refactor profiles scraper api #111#168
AJayaV06 wants to merge 3 commits intodevelopfrom
refactor-profiles-scraper-api-111

AJayaV06 commented Apr 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AJayaV06 commented Apr 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants