A powerful document analysis tool that extracts text from various document formats (PDF, DOCX, TXT) and performs intelligent topic classification and keyword matching analysis.
- Multi-format Document Support: Extract text from PDF, DOCX, and TXT files
- Intelligent Topic Classification: AI-powered topic classification using Sentence Transformers
- Keyword Matching Analysis: Calculate topic relevance based on predefined keyword sets
- Interactive Web Interface: User-friendly Streamlit-based web application
- Real-time Analysis: Get instant results with visual progress indicators
- Multiple Analysis Pages: Main analysis page and advanced features
- Python 3.8 or higher
- pip package manager
-
Clone the repository
git clone https://github.com/NhanPhamThanh-IT/Scan-PDF-Paper.git cd Scan-PDF-Paper -
Install dependencies
pip install -r requirements.txt
-
Run the application
streamlit run app/main.py
-
Open your browser and navigate to
http://localhost:8501
-
Select a Topic: Choose from predefined topics including:
- AI & Technology
- Healthcare
- Finance
- Environment
- Cybersecurity
- Software Development
- And more...
-
Upload Document: Support for multiple file formats:
- PDF files
- Microsoft Word documents (.docx)
- Plain text files (.txt)
-
Get Analysis Results: View detailed analysis including:
- Total word count
- Keyword matches found
- Topic relevance percentage
- Detailed breakdown of analysis
Access the Advanced page for additional functionality and enhanced analysis options.
Scan-PDF-Paper/
├── app/
│ ├── main.py # Main Streamlit application
│ ├── assets/
│ │ └── themes.css # CSS styling
│ ├── core/
│ │ ├── AI/
│ │ │ └── TopicClassifier.py # AI-powered topic classification
│ │ └── Utils/
│ │ ├── DataHandling.py # Data processing utilities
│ │ ├── FileHandling.py # File extraction utilities
│ │ └── TextHandling.py # Text processing utilities
│ ├── dataset/
│ │ ├── metadata/
│ │ │ └── topics.json # Available topics list
│ │ └── topics_keywords/ # Keyword datasets for each topic
│ │ ├── AI.json
│ │ ├── Healthcare.json
│ │ ├── Finance.json
│ │ └── ...
│ ├── pages/
│ │ ├── MainPage.py # Main analysis interface
│ │ ├── AdvancesPage.py # Advanced features
│ │ └── HelpsPage.py # Help and documentation
│ ├── settings/
│ │ └── ThemeManager.py # Theme management
│ └── ui/
│ ├── PageHeaderComponent.py # Reusable header component
│ ├── ResultComponent.py # Results display component
│ └── TabsComponent.py # Navigation tabs component
├── requirements.txt # Python dependencies
└── README.md # Project documentation
- Streamlit: Web framework for the user interface
- PyMuPDF (fitz): PDF text extraction
- python-docx: Microsoft Word document processing
- Sentence Transformers: AI-powered text analysis
- spaCy: Natural language processing and stop words removal
The application uses the all-MiniLM-L6-v2 model from Sentence Transformers to:
- Generate embeddings for input text
- Compare against predefined topic embeddings
- Calculate cosine similarity scores
- Provide confidence percentages for topic classification
- Document Parsing: Extract raw text from uploaded files
- Text Preprocessing: Remove stop words and normalize text
- Keyword Analysis: Match against topic-specific keyword sets
- AI Classification: Use machine learning for intelligent topic detection
- Results Generation: Calculate relevance scores and generate insights
The application supports analysis across 21+ topic categories:
- Technology: AI, Software, Cybersecurity
- Sciences: Healthcare, Environment, Science
- Business: Finance, Economy, Business
- Society: Education, Politics, Law, Culture
- Lifestyle: Sports, Travel, Food, Art
- Others: Media, Religion, Agriculture, Energy, Security
Run the test suite using pytest:
pytestTest configuration is available in pytest.ini.
streamlit>=1.47.0- Web application frameworkPyMuPDF- PDF processingpython-docx- Word document processingsentence-transformers>=2.6.1- AI text analysistorch>=2.0.0- Machine learning backendspacy- Natural language processing
pytest- Testing frameworkpytest-mock- Testing utilities
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Visit the Help & Documentation page in the application
- Issues: Report bugs or request features via GitHub Issues
- Discussions: Join project discussions on GitHub
- Support for additional file formats (RTF, ODT)
- Batch processing capabilities
- Export results to various formats
- Custom topic creation
- Advanced visualization features
- REST API integration
- First-time loading may take longer due to AI model initialization
- Large documents (>10MB) may require additional processing time
- Recommended RAM: 4GB+ for optimal performance
Built with ❤️ using Python and Streamlit