Skip to content

Refactor profiles scraper api #111#168

Open
AJayaV06 wants to merge 3 commits intodevelopfrom
refactor-profiles-scraper-api-111
Open

Refactor profiles scraper api #111#168
AJayaV06 wants to merge 3 commits intodevelopfrom
refactor-profiles-scraper-api-111

Conversation

@AJayaV06
Copy link
Copy Markdown

No description provided.

Copilot AI review requested due to automatic review settings April 19, 2026 21:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the profiles scraping/parsing pipeline to use the UTD Profiles API (instead of browser scraping), adds profile loading from the new raw JSON format, and improves professor record merging/link consistency during parsing.

Changes:

  • Replaced Chromedp-based profile scraping with HTTP calls to the Profiles API, batching detail requests and writing a raw profiles.json.
  • Added profile JSON decoding + loading logic in parser, plus helpers to pick “best” fields (location/URI/image/titles) when building professors.
  • Added a post-parse synchronization step to keep professor↔section references consistent, along with new unit tests for key helpers.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scrapers/profiles.go Implements API-based profile scrape, batching, dedupe, and pretty JSON output.
scrapers/profiles_test.go Adds tests for request URL building and profile dedupe behavior.
parser/profiles.go Introduces profile-row decoding and professor↔section link syncing utilities.
parser/profiles_test.go Adds tests for selecting best information/URIs/images and building professors from API rows.
parser/profileLoader.go Adds LoadProfiles for the new raw profile JSON shape and a ParseProfiles entrypoint.
parser/professorParser.go Adds profile-row → professor builders/merging and email-based professor matching during section parsing.
parser/parser.go Calls professor↔section link sync after parsing sections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread parser/profiles.go
Comment on lines +78 to +120
// syncProfessorSectionLinks ensures professor->section links match parsed section references.
// Course connections are derived transitively through each section's Course_reference.
func syncProfessorSectionLinks() {
if len(Professors) == 0 || len(Sections) == 0 {
return
}

for _, prof := range Professors {
if prof == nil {
continue
}
if prof.Sections == nil {
prof.Sections = []primitive.ObjectID{}
}
}

for sectionID, section := range Sections {
if section == nil {
continue
}

for _, profID := range section.Professors {
profKey, ok := ProfessorIDMap[profID]
if !ok {
continue
}
prof, exists := Professors[profKey]
if !exists || prof == nil {
continue
}
if !containsObjectID(prof.Sections, sectionID) {
prof.Sections = append(prof.Sections, sectionID)
}
}
}

for _, prof := range Professors {
if prof == nil {
continue
}
prof.Sections = dedupeObjectIDs(prof.Sections)
}
}
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncProfessorSectionLinks claims to ensure professor->section links “match” section references, but the implementation only adds missing section IDs and dedupes; it never removes stale IDs already present on prof.Sections. This can leave incorrect links if prof.Sections was pre-populated or merged. Consider rebuilding each professor’s Sections slice from Sections[...].Professors (or explicitly filtering to only sectionIDs that reference that professor), or rename/reword the function/comment to reflect additive behavior only.

Copilot uses AI. Check for mistakes.
Comment thread parser/profileLoader.go
Comment on lines +23 to +33
payload, err := io.ReadAll(fptr)
if err != nil {
log.Printf("Failed to read profiles JSON: %v", err)
return false
}

rows, err := decodeProfileRows(payload)
if err != nil {
log.Printf("Failed to decode profiles JSON: %v", err)
return false
}
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadProfiles reads the entire profiles.json into memory (io.ReadAll) and then attempts two full json.Unmarshal passes. This is a potential memory/CPU regression versus the existing streaming decoder approach in loadProfiles, especially since the new scraped payload can be large (information/areas/media). Consider decoding with json.Decoder (streaming) and/or decoding once into a lightweight envelope (e.g., json.RawMessage for profile/count) to avoid holding and parsing the full payload multiple times.

Copilot uses AI. Check for mistakes.
Comment thread parser/professorParser.go
Comment on lines +112 to +151
if profByEmail, emailMatch := findProfessorByEmail(email); emailMatch {
profByEmail.Sections = append(profByEmail.Sections, sectionId)
if profByEmail.Email == "" {
profByEmail.Email = email
}
if _, exists := ProfessorIDMap[profByEmail.Id]; !exists {
ProfessorIDMap[profByEmail.Id] = profKey
}
Professors[profKey] = profByEmail
profRefs = append(profRefs, profByEmail.Id)
continue
}

prof = newProfessor(firstName, lastName)
prof.Titles = []string{utils.TrimWhitespace(match[2])}
prof.Email = utils.TrimWhitespace(match[3])
prof.Email = email
prof.Sections = []primitive.ObjectID{sectionId}
profRefs = append(profRefs, prof.Id)
Professors[profKey] = prof
ProfessorIDMap[prof.Id] = profKey
}
return profRefs
}

func findProfessorByEmail(email string) (*schema.Professor, bool) {
normalized := strings.TrimSpace(strings.ToLower(email))
if normalized == "" {
return nil, false
}

for _, prof := range Professors {
if prof == nil {
continue
}
if strings.TrimSpace(strings.ToLower(prof.Email)) == normalized {
return prof, true
}
}

return nil, false
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findProfessorByEmail performs a full scan over the Professors map for every unmatched instructor name. During section parsing this can become O(P*I) where P is professor count and I is instructor matches, which may be noticeable on full-term datasets. Consider maintaining an email->*Professor index map (normalized lowercase) that is updated when professors are created/merged, so lookups are O(1).

Copilot uses AI. Check for mistakes.
Comment thread parser/profiles.go
Comment on lines +78 to +120
// syncProfessorSectionLinks ensures professor->section links match parsed section references.
// Course connections are derived transitively through each section's Course_reference.
func syncProfessorSectionLinks() {
if len(Professors) == 0 || len(Sections) == 0 {
return
}

for _, prof := range Professors {
if prof == nil {
continue
}
if prof.Sections == nil {
prof.Sections = []primitive.ObjectID{}
}
}

for sectionID, section := range Sections {
if section == nil {
continue
}

for _, profID := range section.Professors {
profKey, ok := ProfessorIDMap[profID]
if !ok {
continue
}
prof, exists := Professors[profKey]
if !exists || prof == nil {
continue
}
if !containsObjectID(prof.Sections, sectionID) {
prof.Sections = append(prof.Sections, sectionID)
}
}
}

for _, prof := range Professors {
if prof == nil {
continue
}
prof.Sections = dedupeObjectIDs(prof.Sections)
}
}
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncProfessorSectionLinks mutates global state (Professors/Sections/ProfessorIDMap) and is now invoked by Parse, but there are no tests covering its behavior (e.g., adding missing links, deduping, and handling missing/mismatched ProfessorIDMap entries). Adding a focused unit test would help prevent regressions since this impacts the professors.json output.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants