skip to Main Content

I need to identify all tags which has the word ‘method’ in it.

I developed a python code using requests and regex. The code will first read a text file to extract the ID and then use request to open the URL to identify the tags that have ‘method’ keyword in it however the output is returning empty lists.
Following is the code:

import requests
from bs4 import BeautifulSoup as bs
import re


def read_file():


  with open("C://Users//reshma.regi//PycharmProjects//Method_mining_from_jornals//test_.txt") as f:
        content= f.readlines()
        content = [x.strip() for x in content]
for pmcid in content:
    r = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id='+pmcid+'=my_tool&[email protected]')
    soup = bs(r.content, 'lxml')
    pmc = soup.findAll(re.compile(r'(methods)'))
    print(pmc)

def main():
    read_file()

if __name__ == '__main__':
    main()

To test the code, you can use the following pmcid:[2150890,2364767]

The desired output for PMCID: 2150890 is:

    <title>Materials and methods</title>
    <sec>
<title>Chromatin unfolding assay</title>
<p>
To construct the EGFP-lac-E2F1 and EGFP-lac-p53 fusion expression vectors, the PCR fragments that encode the E2F1 (aa 368–437) and p53 (aa 1–73), respectively, were cloned into the AscI site in the plasmid p3′SS d tb Cl EGFP AscI (NYE4) (A.C. Nye and A.S. Belmont, personal communication). The correct orientation of the inserts was identified by colony hybridization and confirmed by DNA sequencing. To construct the lac-BRCA1 plasmids, the sequence for lac repressor was first amplified by PCR from the plasmid NYE4. The lac sequence was cloned into the HindIII–NotI sites of pRC-CMV (Invitrogen), generating pRC-lac. Various BRCA1 fragments and the COBRA1 sequence were amplified by PCR and inserted into the unique AscI site of pRC-lac.
</p>
<p>
The chromatin unfolding experiments were performed as previously described (
<xref rid="bib43" ref-type="bibr">Tumbar et al., 1999</xref>
). Briefly, AO3_1 cells were transiently transfected with the lac expression vectors using the FuGENE 6 transfection reagent (Roche). The medium was changed 24 h after transfection and cells were immunostained 48 h after transfection. Cells grown on glass coverslips were fixed with 1.6% paraformaldehyde for 30 min in PBS, permeabilized with 0.2% Triton X-100 in PBS for 5 min, and blocked in 1% normal goat serum in PBS for 1 h. The coverslips were then incubated with primary antibodies at room temperature for 1 h, followed by incubation with the appropriate secondary antibodies for 1 h. Unless otherwise specified, a rabbit polyclonal anti–lac repressor antibody (Stratagene) and mouse monoclonal anti-FLAG antibody (Sigma-Aldrich) were applied at 1:20,000 dilution. The anti–acetylated histone H3 antibody was raised against di-acetylated H3 (Lys9 and Lys14) (
<xref rid="bib4" ref-type="bibr">Boggs et al., 1996</xref>
) (
<xref rid="bib20" ref-type="bibr">Lin et al., 1989</xref>
), a gift from Drs. C. Mizzen and C.D. Allis (University of Virginia, Charlottesville, VA). The secondary antibodies were goat anti–rabbit IgG-conjugated with Cy3 (Amersham), and horse anti–mouse IgG-conjugated with fluorescein isothiocyanate (FITC; Vector Laboratories).
</p>
<p>
For visualization of the nuclei, cells were stained with 0.2 μg/ml 4,6-diamidino-2-phenylindole (DAPI) for 5 min before mounting. Fluorescent images were acquired by a charged-coupled device camera (Hamamatsu ORCA) that was mounted on a Nikon Microphot-SA microscope and equipped with Improvision Openlab software. Confocal images were collected on a Zeiss LSM410 confocal microscope. Figs. were assembled using Adobe Photoshop (v. 5.5).
</p>
</sec>
<sec>
<title>Yeast two-hybrid screen</title>
<p>
To identify proteins that specifically interact with the BRCT1 repeat of BRCA1, the standard yeast two-hybrid screen was performed in the following manner. First, the bait plasmid was generated by inserting a PCR-amplified cDNA fragment encoding the BRCT1 sequence (aa 1642–1736) into the NdeI–EcoRI restriction sites of pAS2–1 (CLONTECH Laboratories, Inc.), resulting in an in-frame fusion with the GAL4 DNA-binding domain. The resultant plasmid, pAS2-BRCT1, and a human ovary cDNA prey library (CLONTECH Laboratories, Inc.) were sequentially transformed into the
<italic>S. cerevisiae</italic>
strain CG1945 according to the manufacturer's instructions (CLONTECH Laboratories, Inc.). Transformants were plated on synthetic medium lacking tryptophan, leucine and histidine but containing 1 mM 3-aminotriazole. Approximately 2.3 million transformants were screened. The candidate clones were retrieved from the yeast cells and reintroduced back to the same yeast strain to verify the interaction between the candidates and the BRCT1 bait. The specificity of the interaction was determined by comparing the interactions between the candidates and various bait constructs.
</p>
</sec>
<sec>
<title>Coimmunoprecipitation</title>
<p>
HEK293T cells were transfected using LipofectAmine 2000 (GIBCO BRL). 24 h after transfection, cells were washed twice with PBS and lysed in 0.5 ml lysis buffer (50 mM Hepes, pH 8, 250 mM NaCl, 0.1% NP-40, and protease inhibitor tablets from Roche). After brief sonication, the lysate was centrifuged at 16,000
<italic>g</italic>
for 12 min at 4°C. The supernatant was used for subsequent coimmunoprecipitation. 20 μl of the supernatant was used as crude extract for detecting protein expression level. 15 μl of a 50% slurry of the anti-FLAG agarose beads (Sigma-Aldrich) was used in each immunoprecipitation. Immunoprecipitation was performed overnight at 4°C. The beads were centrifuged at 3,300 rpm for 2 min, and washed three times with washing buffer (50 mM Hepes, pH8, 500 mM NaCl, 0.5% NP-40) and three times with RIPA buffer (50 mM Tris, pH 8.0, 150 mM NaCl, 1% NP-40, 0.1% SDS, and 0.5% sodium deoxycholate). Each wash was performed for at least 30 min. The precipitates were then eluted in 15 μl 2× SDS-PAGE sample buffer. Gel electrophoresis was followed by immunoblotting according to standard procedures.
</p>
</sec>
<sec>
<title>GST pulldown assay</title>
<p>
The PCR fragments encoding various BRCA1 fragments were cloned into pGEX-2T and the constructs were confirmed by sequencing. The GST-BRCA1 proteins were made and purified, with the induction of protein expression performed at 19°C overnight. pcDNA3 vector containing the COBRA1 gene was used for in vitro transcription and translation in the TnT Reticulocyte Lysate system (Promega). The
<sup>35</sup>
S-labeled COBRA1 was translated in vitro according to the manufacturer's instructions and mixed with 10 μg the GST-bound bead in 0.5 ml binding buffer (50 mM Tris-HCl, pH 7.5, 150 mM NaCl, 1 mM EDTA, 0.3 mM DTT, 0.1% NP-40 and protease inhibitor tablet). The binding reaction was performed at 4°C overnight and the beads were subsequently washed four times with washing buffer (same as binding buffer except 0.5% NP-40 was used), 30 min each time. The beads were eluted in 10 μl 2 × SDS-PAGE sample buffer and the proteins were resolved on 10% denaturing gel. The gel was then dried and exposed to x-ray films for overnight.
</p>
</sec>
</sec>

2

Answers


  1. as html

    It’s hard to know what the "right" thing to do with that document is,
    since it’s not exactly HTML.
    Oh, I see, the 2nd line explains that it’s XML conforming to nlm-articleset-2.0.dtd.
    There are XML parsers that may be a better fit than BS4,
    but in any event we’ll press onward.

    Suppose we munge it into something a little closer to well-formed HTML:

    soup = bs(r.content.replace('<sec', '<div').replace(' sec-type=', ' class='), 'lxml')
    divs = soup.find_all('div')
    

    Then if we ask for all divs, divs[8] contains the desired content.

    This obtains just a single section,

    divs = soup.find_all('div', class_='materials|methods')
    

    so divs[0] has the content.

    Within a section you might find it helpful to query for <p> or <title> tags.

    as xml

    ElementTree

    BeautifulSoup is great for scraping browser web pages.
    But that’s not how this document is structured.
    Let’s use a different technique, which parses according to that structure.

    import xml.etree.ElementTree as et
    
    root = et.fromstring(r.content)
    for i, sec in enumerate(root.iter('sec')):
        if sec.attrib:
            print(i, sec.attrib)
    
    8 {'sec-type': 'materials|methods'}
    

    You can continue to parse out the pieces from there.

    xmltodict

    You might find that the simple API offered by xmltodict
    ($ pip install xmltodict) is a good fit for this project.

    Login or Signup to reply.
  2. I believe the following code has an output like the one you provided for PMCID: 2150890:

        pmc = soup.find_all('title',string=re.compile(r'method'))
        for i in pmc:
           print(i.parent)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search