Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the profiles scraping/parsing pipeline to use the UTD Profiles API (instead of browser scraping), adds profile loading from the new raw JSON format, and improves professor record merging/link consistency during parsing.
Changes:
- Replaced Chromedp-based profile scraping with HTTP calls to the Profiles API, batching detail requests and writing a raw
profiles.json. - Added profile JSON decoding + loading logic in
parser, plus helpers to pick “best” fields (location/URI/image/titles) when building professors. - Added a post-parse synchronization step to keep professor↔section references consistent, along with new unit tests for key helpers.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
scrapers/profiles.go |
Implements API-based profile scrape, batching, dedupe, and pretty JSON output. |
scrapers/profiles_test.go |
Adds tests for request URL building and profile dedupe behavior. |
parser/profiles.go |
Introduces profile-row decoding and professor↔section link syncing utilities. |
parser/profiles_test.go |
Adds tests for selecting best information/URIs/images and building professors from API rows. |
parser/profileLoader.go |
Adds LoadProfiles for the new raw profile JSON shape and a ParseProfiles entrypoint. |
parser/professorParser.go |
Adds profile-row → professor builders/merging and email-based professor matching during section parsing. |
parser/parser.go |
Calls professor↔section link sync after parsing sections. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // syncProfessorSectionLinks ensures professor->section links match parsed section references. | ||
| // Course connections are derived transitively through each section's Course_reference. | ||
| func syncProfessorSectionLinks() { | ||
| if len(Professors) == 0 || len(Sections) == 0 { | ||
| return | ||
| } | ||
|
|
||
| for _, prof := range Professors { | ||
| if prof == nil { | ||
| continue | ||
| } | ||
| if prof.Sections == nil { | ||
| prof.Sections = []primitive.ObjectID{} | ||
| } | ||
| } | ||
|
|
||
| for sectionID, section := range Sections { | ||
| if section == nil { | ||
| continue | ||
| } | ||
|
|
||
| for _, profID := range section.Professors { | ||
| profKey, ok := ProfessorIDMap[profID] | ||
| if !ok { | ||
| continue | ||
| } | ||
| prof, exists := Professors[profKey] | ||
| if !exists || prof == nil { | ||
| continue | ||
| } | ||
| if !containsObjectID(prof.Sections, sectionID) { | ||
| prof.Sections = append(prof.Sections, sectionID) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| for _, prof := range Professors { | ||
| if prof == nil { | ||
| continue | ||
| } | ||
| prof.Sections = dedupeObjectIDs(prof.Sections) | ||
| } | ||
| } |
There was a problem hiding this comment.
syncProfessorSectionLinks claims to ensure professor->section links “match” section references, but the implementation only adds missing section IDs and dedupes; it never removes stale IDs already present on prof.Sections. This can leave incorrect links if prof.Sections was pre-populated or merged. Consider rebuilding each professor’s Sections slice from Sections[...].Professors (or explicitly filtering to only sectionIDs that reference that professor), or rename/reword the function/comment to reflect additive behavior only.
| payload, err := io.ReadAll(fptr) | ||
| if err != nil { | ||
| log.Printf("Failed to read profiles JSON: %v", err) | ||
| return false | ||
| } | ||
|
|
||
| rows, err := decodeProfileRows(payload) | ||
| if err != nil { | ||
| log.Printf("Failed to decode profiles JSON: %v", err) | ||
| return false | ||
| } |
There was a problem hiding this comment.
LoadProfiles reads the entire profiles.json into memory (io.ReadAll) and then attempts two full json.Unmarshal passes. This is a potential memory/CPU regression versus the existing streaming decoder approach in loadProfiles, especially since the new scraped payload can be large (information/areas/media). Consider decoding with json.Decoder (streaming) and/or decoding once into a lightweight envelope (e.g., json.RawMessage for profile/count) to avoid holding and parsing the full payload multiple times.
| if profByEmail, emailMatch := findProfessorByEmail(email); emailMatch { | ||
| profByEmail.Sections = append(profByEmail.Sections, sectionId) | ||
| if profByEmail.Email == "" { | ||
| profByEmail.Email = email | ||
| } | ||
| if _, exists := ProfessorIDMap[profByEmail.Id]; !exists { | ||
| ProfessorIDMap[profByEmail.Id] = profKey | ||
| } | ||
| Professors[profKey] = profByEmail | ||
| profRefs = append(profRefs, profByEmail.Id) | ||
| continue | ||
| } | ||
|
|
||
| prof = newProfessor(firstName, lastName) | ||
| prof.Titles = []string{utils.TrimWhitespace(match[2])} | ||
| prof.Email = utils.TrimWhitespace(match[3]) | ||
| prof.Email = email | ||
| prof.Sections = []primitive.ObjectID{sectionId} | ||
| profRefs = append(profRefs, prof.Id) | ||
| Professors[profKey] = prof | ||
| ProfessorIDMap[prof.Id] = profKey | ||
| } | ||
| return profRefs | ||
| } | ||
|
|
||
| func findProfessorByEmail(email string) (*schema.Professor, bool) { | ||
| normalized := strings.TrimSpace(strings.ToLower(email)) | ||
| if normalized == "" { | ||
| return nil, false | ||
| } | ||
|
|
||
| for _, prof := range Professors { | ||
| if prof == nil { | ||
| continue | ||
| } | ||
| if strings.TrimSpace(strings.ToLower(prof.Email)) == normalized { | ||
| return prof, true | ||
| } | ||
| } | ||
|
|
||
| return nil, false |
There was a problem hiding this comment.
findProfessorByEmail performs a full scan over the Professors map for every unmatched instructor name. During section parsing this can become O(P*I) where P is professor count and I is instructor matches, which may be noticeable on full-term datasets. Consider maintaining an email->*Professor index map (normalized lowercase) that is updated when professors are created/merged, so lookups are O(1).
| // syncProfessorSectionLinks ensures professor->section links match parsed section references. | ||
| // Course connections are derived transitively through each section's Course_reference. | ||
| func syncProfessorSectionLinks() { | ||
| if len(Professors) == 0 || len(Sections) == 0 { | ||
| return | ||
| } | ||
|
|
||
| for _, prof := range Professors { | ||
| if prof == nil { | ||
| continue | ||
| } | ||
| if prof.Sections == nil { | ||
| prof.Sections = []primitive.ObjectID{} | ||
| } | ||
| } | ||
|
|
||
| for sectionID, section := range Sections { | ||
| if section == nil { | ||
| continue | ||
| } | ||
|
|
||
| for _, profID := range section.Professors { | ||
| profKey, ok := ProfessorIDMap[profID] | ||
| if !ok { | ||
| continue | ||
| } | ||
| prof, exists := Professors[profKey] | ||
| if !exists || prof == nil { | ||
| continue | ||
| } | ||
| if !containsObjectID(prof.Sections, sectionID) { | ||
| prof.Sections = append(prof.Sections, sectionID) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| for _, prof := range Professors { | ||
| if prof == nil { | ||
| continue | ||
| } | ||
| prof.Sections = dedupeObjectIDs(prof.Sections) | ||
| } | ||
| } |
There was a problem hiding this comment.
syncProfessorSectionLinks mutates global state (Professors/Sections/ProfessorIDMap) and is now invoked by Parse, but there are no tests covering its behavior (e.g., adding missing links, deduping, and handling missing/mismatched ProfessorIDMap entries). Adding a focused unit test would help prevent regressions since this impacts the professors.json output.
No description provided.